Senior HPC Operations Engineer at Lambda

Summary

The job is for a remote HPC Cluster Manager responsible for managing large-scale AI workloads, troubleshooting issues in HPC clusters, and mentoring team members. The company offers competitive compensation, health insurance, retirement benefits, paid time off, flexible hours, commuting stipends, 401k plan, and wellness programs.

Requirements

Have 10+ years of experience in managing HPC clusters
Have 10+ years of everyday Linux experience
Have a strong understanding of HPC architecture (compute, networking, storage)
Have an innate attention to detail
Have experience with Bright Cluster Manager or similar cluster management tools
Are an expert in configuring and troubleshooting: SFP+ fiber, InfiniBand (IB), and 100 GbE network fabrics, Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments, Linux-based compute nodes, firmware updates, driver installation, SLURM, Kubernetes, or other job scheduling systems

Responsibilities

Remotely provision and manage large-scale HPC clusters for AI workloads (up to many thousands of nodes)
Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
Provide context and details to an automation team to further automate the deployment process
Provide clear and detailed requirements back to HPC design team on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
Contribute to the creation and maintenance of Standard Operating Procedures
Provide regular and well-communicated updates to project leads throughout each deployment
Mentor and assist less-experienced team members

Preferred Qualifications

Experience with machine learning and deep learning frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf)
Experience with containerization technologies (Docker, Kubernetes)
Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)

Benefits

Generous cash & equity compensation
Investors include Gradient Ventures, Google’s AI-focused venture fund
We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Health, dental, and vision coverage for you and your dependents
Commuter/Work from home stipends
401k Plan with 2% company match
Flexible Paid Time Off Plan that we all actually use

Senior HPC Operations Engineer

Lambda

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Data

Principal

ThisWay Global

Remote

Sales

Executive

Canonical

Remote

Product

Mid-level

Remote

Software Development

Senior