Lambda is hiring a
Senior Kubernetes Operations Engineer
closedLambda
π΅ $169k-$243k
πRemote - United States, Canada
Summary
The job is for a remote Operations Engineer/SRE/Sysadmin position at Lambda, focusing on managing and maintaining bare-metal Kubernetes clusters. The role involves handling cluster issues, improving tooling, assisting customers, and working with other teams.
Requirements
- Are an experienced operations engineer, SRE, sysadmin or similar with a deep knowledge of running Linux clusters and systems
- Are very familiar with running on bare-metal (including knowledge of BMCs, kernel drivers, PXE, RAID, VLANs, hypervisors)
- Have a good understanding of containers, virtualisation, and the mechanisms underpinning them
- Have a good understanding of daily operation, bug-fixing and maintenance of Kubernetes
- Have experience in an on-call environment and with incident response
- Can perform incident post-mortems and develop procedures and tooling to prevent root causes from reoccurring
- Have an excellent ability to learn on-the-fly and adapt to solve problems
- Are able to work either independently with limited direction, or as part of a team
- Are able to work with customers during incidents either via tickets, live messaging, or as part of a larger call
Responsibilities
- Remotely install, upgrade, operate and maintain bare-metal Kubernetes clusters (up to thousands of nodes each)
- Handle cluster degradation, recovery and resizing using our fleet management tooling
- Perform out-of-hours on-call response for critical incidents as part of a well-balanced on-call rotation
- Work on improving our tooling, automation, and processes, for both daily operations, alerting, and incident response
- Dive into systems at a low level to solve unique cluster problems and write up your findings
- Assist customers with high-level Kubernetes questions and integration with applications, storage and authentication
- Assist with initial cluster build-outs and validation to help identify failed hardware before customer delivery
- Work closely with our HPC Ops and Datacenter Ops teams on issues that require lower-level expertise or cross-functional solutions
- Mentor and assist less-experienced team members
- Have a voice in our product direction and help us think about how to minimize operational costs and complexity
Preferred Qualifications
- Deep Kubernetes experience
- Experience with user-level restrictions and hardening (e.g. AppArmor)
- Experience with network engineering
- Experience with HPC clusters, environments & tooling
- Experience with large-scale AI/ML training clusters
- Experience with machine learning/AI frameworks
Benefits
- Generous cash & equity compensation
- Investors include Gradient Ventures, Googleβs AI-focused venture fund
- We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
- Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
- We have a wildly talented team of 200, and growing fast
- Health, dental, and vision coverage for you and your dependents
- Commuter/Work from home stipends
- 401k Plan with 2% company match
- Flexible Paid Time Off Plan that we all actually use
This job is filled or no longer available
Similar Jobs
- π°~$48k-$59kπUnited Kingdom
- π°~$150k-$222kπUnited States
- π°~$150k-$200kπWorldwide
- π°~$48k-$59kπBelgium
- π°~$48k-$59kπUnited Kingdom
- π°~$150k-$222kπNetherlands
- π°$80k-$120kπChina
- π°~$66kπFrance
- π°$175k-$200kπUnited States
- π°~$150k-$222kπMexico