Remote Senior HPC Systems Engineer
closedLambda
π΅ $180k-$250k
πRemote - United States, Canada
Job highlights
Summary
The job is for a Deep Learning Cloud Architect at Lambda, where the employee will design and architect AI supercomputers, improve performance of HPC storage and networking infrastructure, work closely with ML team, set up monitoring, logging and alerting, provide guidance to HPC customers, and more. The ideal candidate has expertise in large scale HPC network and storage infrastructure, experience building complex software using Python, deep understanding of Linux fundamentals, experience with large GPU clusters, virtualization, and kubernetes. The company offers competitive compensation, health benefits, commuting stipends, 401k plan, flexible paid time off, and more.
Requirements
- Expertise with architecting, operating, and debugging large scale HPC network and storage infrastructure, ideally using MPI, NCCL, RDMA, Infiniband, and parallel file systems
- Experience building complex, high-quality software using Python
- Deep understanding of Linux fundamentals, especially its networking stack
Responsibilities
- Design and architect the state-of-the-art AI supercomputers powering our cloud
- Introduce technology and software to improve the performance, resiliency, and quality of service of our HPC storage and networking infrastructure
- Work closely with our ML team to benchmark, tune, and optimize our hypervisors, network, and storage
- Set up monitoring, logging and alerting to ensure high availability and observability
- Provide guidance and represent the interests of our HPC customers
Preferred Qualifications
- Experience with large GPU clusters is strongly preferred
- Experience with virtualization and kubernetes
Benefits
- Generous cash & equity compensation
- Investors include Gradient Ventures, Googleβs AI-focused venture fund
- We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
- Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
- We have a wildly talented team of 200, and growing fast
- Health, dental, and vision coverage for you and your dependents
- Commuter/Work from home stipends
- 401k Plan with 2% company match
- Flexible Paid Time Off Plan that we all actually use
This job is filled or no longer available
Similar Remote Jobs
- π°$138k-$187kπUnited States
- π°$148k-$199kπUnited States
- πUnited States
- π°$177k-$212kπUnited States
- πWorldwide