Senior Staff Software Engineer, Reliability Engineering

Airbnb
Summary
Join Airbnb as a Sr. Staff Engineer, Site Reliability Engineer (SRE) and play a key role in developing and implementing a best-in-class enterprise-wide SRE program. You will drive the long-term reliability strategy, ensuring the performance and reliability of Airbnb's infrastructure and products. Collaborate with engineering teams to provide tools and expertise for reliable services. As a senior technical contributor, you will solve broader technical challenges, lend expertise to specific teams, and contribute code and/or participate in architecture/design. This role involves developing roadmaps, designing SRE architecture, creating incident management processes, fostering the SRE model, and bringing a customer focus to reliability. You will also build partnerships, learn from incidents, mentor other SREs, and create a culture of reliability.
Requirements
- BS, MS, or PhD in computer science, related field, or equivalent work experience
- 12+ years of software engineering experience, with a significant portion dedicated to system architecture and design in consumer-facing technology companies
- Strong leadership skills, with 5+ years of experience as a senior-level technical lead or architect, driving the technical direction and strategy across multiple teams or projects
- Excellent communication and collaboration skills, with a proven track record of working effectively across teams and organizations
- Demonstrated expertise in building and scaling high-availability systems and platforms, with a deep understanding of multi-cloud environments
Responsibilities
- Develop a roadmap with a longer-term vision for Reliability and serve as a strategic thought partner within the organization
- Design, implement and influence company-wide SRE architecture, innovation, engineering, and standards
- Create incident management processes that can scale with the organization as it continues its rapid growth. Assess how the organization manages incidents and responds to them; reduce operational toil stemming from incident management
- Foster the SRE/Reliability model that takes into consideration the nuances of an engineering culture that has a great sense of ownership over their services
- Bring a strong customer focus to the Reliability function, centered on optimizing the infrastructure and platform, and ensuring systems are highly available and performant
- Develop Production Readiness standards to ensure service reliability. Automate as much as possible and always configure as code. Predict future failures and work proactively to mitigate them. Advocate and implement reliable design patterns (circuit breakers, graceful degradation, etc)
- Create a culture where Reliability is a state of mind, instilling a proactive approach to seeing patterns and opportunities to increase leverage and tooling
- Build deep partnerships with engineering leaders. Work closely with product engineering teams on design and implementation choices of large-scale distributed systems
- Partner with the broader organization to learn from incidents through a blameless post mortem process
- Mentor and lead other Site Reliability Engineers. Uplevel and support others with servant leadership, mentorship, advocacy, and allyship
Benefits
- Bonus
- Equity
- Benefits
- Employee Travel Credits