Senior Site Reliability Engineer

Xero Logo

Xero

πŸ“Remote - Australia

Summary

Join Xero's Incident and Problem Management team as an experienced SRE professional to build, deliver, and maintain robust incident management processes and tooling. Drive enduring reliability through fast responses to high-severity incidents, build a world-class process, and lead technical discussions to identify and track actions during incidents. Deep dive into incident causes, proactively examine potential future incidents, and work with engineering teams to remove risks. Build playbooks and automation for quick responses and provide ongoing training. This role will be a Technical Duty Officer (TDO), driving fast mitigation and resolution of impactful events. The position requires strong technical skills, experience in SRE, and excellent communication abilities.

Requirements

  • Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment
  • Strong coding experience (preferably with Python)
  • Hands-on experience troubleshooting AWS hosted services
  • Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues
  • Strong communication (oral & written) skills including the ability to translate technical issues/concepts into agreed actions

Responsibilities

  • Own the incident management process, ensuring it drives enduring reliability across all products and services within Xero
  • Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution
  • Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department
  • Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team
  • Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability
  • Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency

Benefits

  • Offering very generous paid leave to use however you’d like (plus statutory holidays!)
  • Dedicated paid leave to care for your physical and mental wellbeing as well as an Employee Assistance Program to access mental health care for you and your family
  • Health insurance
  • Life insurance
  • And income protection
  • We offer wellbeing and sports programmes, employee resource groups
  • 26 weeks of paid parental leave for primary caregivers
  • An Employee Share Plan
  • Beautiful offices
  • Flexible working
  • Career development
  • And many other benefits that reflect our human value

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.