Lead Site Reliability Engineer

Xero
Summary
Join Xero's Incident and Problem Management team as a Lead Engineer and drive enduring reliability. You will own the incident management process, provide expert leadership during outages, and lead the transformation to a world-leading SRE organization. Responsibilities include developing scalable process frameworks, collaborating with product teams to analyze failures, and promoting SRE principles. The role requires a strong technical background, deep experience in SRE, and extensive experience leading technical responses to high-severity cloud issues. You'll be the backbone of a new team, providing a Technical Duty Officer (TDO) function. Xero offers generous paid leave, health insurance, life insurance, income protection, wellbeing programs, parental leave, and other benefits.
Requirements
- Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment
- Strong hands-on coding experience (preferably Python) and knowledge of software engineering best practice
- Hands-on experience troubleshooting AWS hosted services
- Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues
- Strong communication (oral & written) skills including the ability to translate technical issues/concepts into agreed actions
Responsibilities
- Own the incident management process, ensuring it drives enduring reliability across all products and services within Xero
- Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution
- Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department
- Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team
- Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability
- Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency
Benefits
- Offering very generous paid leave to use however youβd like (plus statutory holidays!)
- Dedicated paid leave to care for your physical and mental wellbeing as well as an Employee Assistance Program to access mental health care for you and your family
- Health insurance
- Life insurance
- And income protection
- We offer wellbeing and sports programmes, employee resource groups
- 26 weeks of paid parental leave for primary caregivers
- An Employee Share Plan
- Beautiful offices
- Flexible working
- Career development
- And many other benefits that reflect our human value