Lead Site Reliability Engineer
Xero
Summary
Join Xero's Product SRE team as a Lead Engineer and become the most senior technical resource, empowering teams to own and drive reliability across the product landscape. You will contribute to Xero's Product SRE strategy and the transformation of its SRE culture. As an expert communicator, you will manage change and ensure the value of robust systems is clearly communicated. This role requires a highly technical individual with a strong engineering background and deep experience in SRE. You will provide technical leadership, build relationships with product engineering teams, champion observability best practices, and build a culture of continuous improvement. The ideal candidate will have a proven track record in technical leadership, a strong engineering and hands-on SRE background, and a passion for delivering a high-quality and highly stable customer experience.
Requirements
- Proven track record in technical leadership roles, with the ability to inspire and empower cross-functional teams to achieve operational excellence and drive continuous improvement
- Extremely technical skillset, with strong engineering and hands-on SRE background. Demonstrable experience of being the technical authority in a highly technical team
- Deep and proven experience in providing technical leadership and mentoring in world class embedded SRE teams in a fast growing company
- Obsessed with delivering a high quality and highly stable customer experience. Passion for customer-first thinking, with a strong product mindset helping to understand and anticipate customer needs
- Experience of building and delivering an error budget culture associated with consistent breaches of SLA/SLO. Coupled with a 24/7 focus on incident response and remediation
- Broad and deep technical understanding of modern cloud technologies (AWS, Azure, GCP) and their incident and problem management practices, particularly high-growth, high-availability SaaS-based transactional systems
- Proficiency in one or more object-oriented programming languages (C#, JavaScript, Java, Python etc) or experience with infrastructure-as-code (e.g. Terraform, Cloudformation)
- Experience using observability tooling to monitor the health of a highly distributed system
Responsibilities
- Provide technical leadership to ensure completion of the day to day deliverables of a dedicated product SRE team
- Build long term relationships with product engineering teams, ensuring everyone can deliver on system reliability with a theme of continuous improvement
- Champion observability best practice, ensuring implementation across products to ensure fast detection of impactful events
- Build a culture of continuous improvement to ensure product reliability is continuously improving and impact of issues are reduced; create and actively monitor quality standards for SRE teams and report regularly on its adherence
- Build and deliver an Error Budget culture associated with consistent breaches of SLA/SLO
- Provide ongoing training across the business to ensure reliability requirements are well understood and incorporated into product designs
Preferred Qualifications
- Any experience with reliability concepts such as: capacity management, autoscaling, safe deployment and releases, software strategies for reliability, fault tolerance, and graceful failure
- Understanding of human factors, safety science, and resilience engineering
Benefits
- Very generous paid leave to use however youβd like (plus statutory holidays!)
- Dedicated paid leave to care for your physical and mental wellbeing
- An Employee Assistance Program to access mental health care for you and your family
- Free medical insurance
- Wellbeing and sports programmes
- Employee resource groups
- 26 weeks of paid parental leave for primary caregivers
- An Employee Share Plan
- Beautiful offices
- Flexible working
- Career development