Summary
Join Airbnb's Site Reliability Engineering team as a Staff Software Engineer! You will design, implement, and maintain tools and systems for service reliability, collaborating with engineering teams to ensure service reliability at scale. Your expertise will be crucial in improving incident management and bolstering overall operational efficiency. You'll be a key member of the first responder SRE team, leading incident management and mentoring others. This role requires strong technical skills, excellent communication, and a commitment to continuous learning. The position is US-remote eligible, with occasional office work. Compensation includes a competitive salary, bonus, equity, benefits, and employee travel credits.
Requirements
- Bachelor's degree in Computer Science or related field
- 9+ years of experience in software engineering or SRE roles, with a focus on large scale distributed systems
- Strong coding skills in at least one programming language, such as Java, Python, or Go
- Experience with distributed systems and service-oriented architectures
- Experience with cloud computing platforms such as AWS or Google Cloud Platform
- Strong conviction in software development best practices, including version control, automated testing, and continuous integration and delivery
- Experience with containerization technologies such as Docker and Kubernetes
- Excellent problem-solving and analytical skills, with a strong attention to detail
- Ability to work effectively in a fast-paced and dynamic environment
- Strong communication and interpersonal skills
Responsibilities
- Design, implement and maintain the tools and systems that support service reliability, monitoring, and alerting
- Collaborate with other engineering teams to ensure services are designed with reliability in mind, and provide guidance on the appropriate use of tooling and automation
- Identify opportunities to improve the reliability, scalability, and efficiency of our services and drive their implementation
- Work with SREs to understand the challenges they face in operating our services and develop tools and systems to help them manage these challenges
- Participate in incident response and post-mortems to identify and address systemic issues
- Continuously evaluate new technologies and industry best practices to improve our SRE tooling and incident response procedures
- Gain and maintain an intimate understanding of how the critical parts of the site work (services, infrastructure, tooling, and processes)
- Lead high-urgency incident management and mentor less-experienced team members in effectively handling incidents
- Contribute to better incident retrospectives, driving improvements in our overall reliability and incident response time
Benefits
- Bonus
- Equity
- Benefits
- Employee Travel Credits
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.