Lambda is hiring a
HPC Operations Engineer in United States

Logo of Lambda
HPC Operations Engineer
🏢 Lambda
💵 $120k-$160k
📍United States
📅 Posted on Jun 10, 2024

Summary

The job is a remote position for an HPC/AI cluster deployment and configuration specialist at Lambda. The role involves deploying and configuring large-scale HPC clusters, troubleshooting issues, providing clear updates to project leads, staying updated on the latest HPC/AI technologies, and contributing to Standard Operating Procedures.

Requirements

  • Have a good understanding of HPC/AI architecture, operating systems, firmware, software, and networking
  • Have 3+ years of experience in deploying and configuring HPC clusters for AI workloads
  • Have an innate attention to detail
  • Be familiar with Bright Cluster Manager or similar cluster management tools

Responsibilities

  • Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
  • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
  • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
  • Provide context and details to an automation team to further automate the deployment process
  • Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
  • Contribute to the creation and maintenance of Standard Operating Procedures
  • Provide regular and well-communicated updates to project leads throughout each deployment

Preferred Qualifications

  • Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf)
  • Experience with containerization technologies (Docker, Kubernetes)
  • Experience working with the technologies that underpin Lambda's cloud business (GPU acceleration, virtualization, and cloud computing)

Benefits

  • Generous cash & equity compensation
  • Investors include Gradient Ventures, Google’s AI-focused venture fund
  • Experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
  • A wildly talented team of 200, and growing fast
  • Health, dental, and vision coverage for you and your dependents
  • Commuter/Work from home stipends
  • 401k Plan with 2% company match
  • Flexible Paid Time Off Plan that we all actually use
Help us out by mentioning to Lambda that you discovered this job opportunity on JobsCollider. Your support is greatly appreciated. Thank you 🙏
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Jobs