Lambda is hiring a
Senior HPC Operations Engineer

closed
Logo of Lambda

Lambda

💵 $170k-$230k
📍Remote - United States

Summary

The job is for a remote HPC Cluster Manager responsible for managing large-scale AI workloads, troubleshooting issues in HPC clusters, and mentoring team members. The company offers competitive compensation, health insurance, retirement benefits, paid time off, flexible hours, commuting stipends, 401k plan, and wellness programs.

Requirements

  • Have 10+ years of experience in managing HPC clusters
  • Have 10+ years of everyday Linux experience
  • Have a strong understanding of HPC architecture (compute, networking, storage)
  • Have an innate attention to detail
  • Have experience with Bright Cluster Manager or similar cluster management tools
  • Are an expert in configuring and troubleshooting: SFP+ fiber, InfiniBand (IB), and 100 GbE network fabrics, Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments, Linux-based compute nodes, firmware updates, driver installation, SLURM, Kubernetes, or other job scheduling systems

Responsibilities

  • Remotely provision and manage large-scale HPC clusters for AI workloads (up to many thousands of nodes)
  • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
  • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
  • Provide context and details to an automation team to further automate the deployment process
  • Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
  • Contribute to the creation and maintenance of Standard Operating Procedures
  • Provide regular and well-communicated updates to project leads throughout each deployment
  • Mentor and assist less-experienced team members

Preferred Qualifications

  • Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf)
  • Experience with containerization technologies (Docker, Kubernetes)
  • Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)

Benefits

  • Generous cash & equity compensation
  • Investors include Gradient Ventures, Google’s AI-focused venture fund
  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
  • Health, dental, and vision coverage for you and your dependents
  • Commuter/Work from home stipends
  • 401k Plan with 2% company match
  • Flexible Paid Time Off Plan that we all actually use
This job is filled or no longer available

Similar Jobs