Lambda is hiring a
Senior Kubernetes Operations Engineer

closed
Logo of Lambda

Lambda

πŸ’΅ $169k-$243k
πŸ“Remote - United States, Canada

Summary

The job is for a remote Operations Engineer/SRE/Sysadmin position at Lambda, focusing on managing and maintaining bare-metal Kubernetes clusters. The role involves handling cluster issues, improving tooling, assisting customers, and working with other teams.

Requirements

  • Are an experienced operations engineer, SRE, sysadmin or similar with a deep knowledge of running Linux clusters and systems
  • Are very familiar with running on bare-metal (including knowledge of BMCs, kernel drivers, PXE, RAID, VLANs, hypervisors)
  • Have a good understanding of containers, virtualisation, and the mechanisms underpinning them
  • Have a good understanding of daily operation, bug-fixing and maintenance of Kubernetes
  • Have experience in an on-call environment and with incident response
  • Can perform incident post-mortems and develop procedures and tooling to prevent root causes from reoccurring
  • Have an excellent ability to learn on-the-fly and adapt to solve problems
  • Are able to work either independently with limited direction, or as part of a team
  • Are able to work with customers during incidents either via tickets, live messaging, or as part of a larger call

Responsibilities

  • Remotely install, upgrade, operate and maintain bare-metal Kubernetes clusters (up to thousands of nodes each)
  • Handle cluster degradation, recovery and resizing using our fleet management tooling
  • Perform out-of-hours on-call response for critical incidents as part of a well-balanced on-call rotation
  • Work on improving our tooling, automation, and processes, for both daily operations, alerting, and incident response
  • Dive into systems at a low level to solve unique cluster problems and write up your findings
  • Assist customers with high-level Kubernetes questions and integration with applications, storage and authentication
  • Assist with initial cluster build-outs and validation to help identify failed hardware before customer delivery
  • Work closely with our HPC Ops and Datacenter Ops teams on issues that require lower-level expertise or cross-functional solutions
  • Mentor and assist less-experienced team members
  • Have a voice in our product direction and help us think about how to minimize operational costs and complexity

Preferred Qualifications

  • Deep Kubernetes experience
  • Experience with user-level restrictions and hardening (e.g. AppArmor)
  • Experience with network engineering
  • Experience with HPC clusters, environments & tooling
  • Experience with large-scale AI/ML training clusters
  • Experience with machine learning/AI frameworks

Benefits

  • Generous cash & equity compensation
  • Investors include Gradient Ventures, Google’s AI-focused venture fund
  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
  • We have a wildly talented team of 200, and growing fast
  • Health, dental, and vision coverage for you and your dependents
  • Commuter/Work from home stipends
  • 401k Plan with 2% company match
  • Flexible Paid Time Off Plan that we all actually use
This job is filled or no longer available

Similar Jobs