Lambda is hiring a
HPC Operations Engineer in United States

Logo of Lambda
HPC Operations Engineer closed
🏢 Lambda
💵 $120k-$160k
📍United States
📅 Posted on Jun 10, 2024

Summary

The job is a remote position for an HPC/AI cluster deployment and configuration specialist at Lambda. The role involves deploying and configuring large-scale HPC clusters, troubleshooting issues, providing clear updates to project leads, staying updated on the latest HPC/AI technologies, and contributing to Standard Operating Procedures.

Requirements

  • Have a good understanding of HPC/AI architecture, operating systems, firmware, software, and networking
  • Have 3+ years of experience in deploying and configuring HPC clusters for AI workloads
  • Have an innate attention to detail
  • Be familiar with Bright Cluster Manager or similar cluster management tools

Responsibilities

  • Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
  • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
  • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
  • Provide context and details to an automation team to further automate the deployment process
  • Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
  • Contribute to the creation and maintenance of Standard Operating Procedures
  • Provide regular and well-communicated updates to project leads throughout each deployment

Preferred Qualifications

  • Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf)
  • Experience with containerization technologies (Docker, Kubernetes)
  • Experience working with the technologies that underpin Lambda's cloud business (GPU acceleration, virtualization, and cloud computing)

Benefits

  • Generous cash & equity compensation
  • Investors include Gradient Ventures, Google’s AI-focused venture fund
  • Experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
  • A wildly talented team of 200, and growing fast
  • Health, dental, and vision coverage for you and your dependents
  • Commuter/Work from home stipends
  • 401k Plan with 2% company match
  • Flexible Paid Time Off Plan that we all actually use
This job is filled or no longer available

Similar Jobs