Lead Site Reliability Engineer

Xero Logo

Xero

πŸ“Remote - Australia

Summary

Join Xero's Incident and Problem Management team as a Lead Engineer and drive enduring reliability. You will own the incident management process, provide expert leadership during outages, and lead the transformation to a world-leading SRE organization. Responsibilities include developing scalable process frameworks, collaborating with product teams to analyze failures, and promoting SRE principles. The role requires a strong technical background, deep experience in SRE, and extensive experience leading technical responses to high-severity cloud issues. You'll be the backbone of a new team, providing a Technical Duty Officer (TDO) function. Xero offers generous paid leave, health insurance, life insurance, income protection, wellbeing programs, parental leave, and other benefits.

Requirements

  • Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment
  • Strong hands-on coding experience (preferably Python) and knowledge of software engineering best practice
  • Hands-on experience troubleshooting AWS hosted services
  • Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues
  • Strong communication (oral & written) skills including the ability to translate technical issues/concepts into agreed actions

Responsibilities

  • Own the incident management process, ensuring it drives enduring reliability across all products and services within Xero
  • Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution
  • Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department
  • Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team
  • Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability
  • Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency

Benefits

  • Offering very generous paid leave to use however you’d like (plus statutory holidays!)
  • Dedicated paid leave to care for your physical and mental wellbeing as well as an Employee Assistance Program to access mental health care for you and your family
  • Health insurance
  • Life insurance
  • And income protection
  • We offer wellbeing and sports programmes, employee resource groups
  • 26 weeks of paid parental leave for primary caregivers
  • An Employee Share Plan
  • Beautiful offices
  • Flexible working
  • Career development
  • And many other benefits that reflect our human value

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.