HPC Operations Engineer
Lambda
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco / Bellevue office location 4 days per week; Lambdaβs designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What Youβll Do - Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) - Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools - Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site - Provide clear and detailed requirements back to other engine... Click Apply to read the full job description.