Lead Site Reliability Engineer, Senior Site Reliability Engineer
Xero
Summary
Join Xero's Incident and Problem Management team as a Lead or Senior Engineer and be a part of the Site Reliability Engineering (SRE) organization. You will own the incident management process, provide expert leadership during outages, and lead the transformation to a world-leading SRE organization. Responsibilities include developing scalable process frameworks, collaborating with product teams to analyze failures, and promoting SRE principles. This role requires experience as an SRE, hands-on AWS troubleshooting, networking knowledge, coding experience (preferably Python), and strong communication skills. Xero offers generous paid leave, wellbeing programs, medical insurance, parental leave, an employee share plan, flexible working, and career development opportunities.
Requirements
- Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment
- Hands-on experience troubleshooting AWS hosted services
- Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues
- Coding experience (preferably Python) building tools, scripting, or automation
- Strong communication (oral & written) skills including the ability to translate technical issues/concepts into agreed actions
Responsibilities
- Own the incident management process, ensuring it drives enduring reliability across all products and services within Xero
- Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution
- Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department
- Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team
- Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability
- Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency
Benefits
- Very generous paid leave to use however youβd like (plus statutory holidays!)
- Dedicated paid leave to care for your physical and mental wellbeing
- An Employee Assistance Program to access mental health care for you and your family
- Free medical insurance
- Wellbeing and sports programmes
- Employee resource groups
- 26 weeks of paid parental leave for primary caregivers
- An Employee Share Plan
- Beautiful offices
- Flexible working
- Career development