Manager Site Reliability Engineering, Observability

Toast
Summary
Join Toast's Site Reliability Engineering team as a Manager of Observability Enablement & Administration! You will provide technical leadership and hands-on contributions, focusing on reliability best practices, observability, and incident resolution. This role involves managing the architecture, administration, and enhancement of observability platforms, ensuring optimal performance. You will create and drive strategic observability initiatives, manage a geographically distributed team, and implement strategies to increase platform reliability. You'll also guide teams in building observable systems, support end-users with training, and gather and analyze metrics for development teams. The position requires hands-on experience managing an SRE or Observability team and deep understanding of observability systems and tools.
Requirements
- Hands-on experience managing an SRE or Observability team, including hiring, mentoring, cross functional collaboration
- Hands-on coding/scripting experience with Go, Python, etc
- Deep understanding of observability systems and tools such as APM, RUM, Synthetics, Splunk, OTEL, Log pipelines, SIEM, Terraform etc
- Background in leading complex engineering projects in a Scrum environment
- Direct exposure to cloud infrastructure and SaaS solutions
- Polyglot technologist/generalist with a thirst for learning
Responsibilities
- In this role you will be responsible for the architecture, administration, maintenance, and enhancement of our observability platforms, ensuring optimal performance and availability for our critical security and business operations
- Create and drive strategic organization-wide observability initiatives in collaboration with technical leadership and Product Management
- Drive day-to-day operations of the team and contribute to the development and prioritization of the SRE roadmap for observability initiatives
- Enable a geographically distributed team of engineers to continue performing at a high level and help increase the impact of their work
- Manage observability architecture design, support, and platform management
- Implement strategies to increase observability platform reliability and performance
- Lead and contribute to initiatives that automate operational toil for observability focused tasks such as those needed for legal and compliance requirements
- Guide teams to build and maintain systems that are observable
- Support end-users with training and technical guidance on observability tools and capabilities
- Gather and analyze metrics from operating systems and applications that enable development teams with observability insights
Benefits
Competitive compensation and benefits programs