Summary
The job is a remote position for an HPC/AI cluster deployment and configuration specialist at Lambda. The role involves deploying and configuring large-scale HPC clusters, troubleshooting issues, providing clear updates to project leads, staying updated on the latest HPC/AI technologies, and contributing to Standard Operating Procedures.
Requirements
- Have a good understanding of HPC/AI architecture, operating systems, firmware, software, and networking
- Have 3+ years of experience in deploying and configuring HPC clusters for AI workloads
- Have an innate attention to detail
- Be familiar with Bright Cluster Manager or similar cluster management tools
Responsibilities
- Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
- Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
- Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
- Provide context and details to an automation team to further automate the deployment process
- Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
- Contribute to the creation and maintenance of Standard Operating Procedures
- Provide regular and well-communicated updates to project leads throughout each deployment
Preferred Qualifications
- Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf)
- Experience with containerization technologies (Docker, Kubernetes)
- Experience working with the technologies that underpin Lambda's cloud business (GPU acceleration, virtualization, and cloud computing)
Benefits
- Generous cash & equity compensation
- Investors include Gradient Ventures, Googleβs AI-focused venture fund
- Experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
- Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
- A wildly talented team of 200, and growing fast
- Health, dental, and vision coverage for you and your dependents
- Commuter/Work from home stipends
- 401k Plan with 2% company match
- Flexible Paid Time Off Plan that we all actually use