Global Incident Management

Stripe
Summary
Join Stripe's Incident Ops team as the Head of Incident Response, leading a global 24/7 team responsible for driving incident response and management. You will lead and optimize incident management processes and automation, ensuring efficiency and adherence to stringent metrics. Establish and maintain a best-in-class incident response framework, upholding Stripe's reliability standards. Responsibilities include incident classification, escalation, notification management, and accountability for key metrics. You will generate actionable insights, collaborate with engineering leadership, and manage incident communications across multiple channels. This role requires leadership and development of a highly effective team, characterized by urgency and cross-functional collaboration.
Requirements
- 10+ years of management experience, including 4+ years of experience managing managers with a proven record in building, growing and transforming teams
- Extensive experience (8+ years) leading incident response for complex, large-scale distributed services with high SLOs/SLAs, coupled with deep expertise in crisis management
- Demonstrated ability to lead, influence other leaders and deliver complex strategic projects involving multiple stakeholders
- Strong analytical skills, and the ability to use data to drive business decisions
- Possesses proficiency in basic incident troubleshooting and a reasonable understanding of system architecture. Fluent in using SQL, Splunk, or similar query languages
- Exceptional communication abilities, capable of adapting incident updates for diverse audiences (executives, external users, internal teams)
- Affinity for a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfort navigating ambiguity, while identifying areas for process improvement and establishing best practices
Responsibilities
- Lead the global 24/7 team of regional managers and incident response managers with ability to be hands-on and support frontline on-call with speed, cross-functional collaboration and escalation
- Develop and own Stripe's incident response and management strategy and cross-functional roadmap, ensuring it aligns with the company's reputation for reliability
- Spearhead and manage Stripe's AI-First strategy for automation of incident response workflows, partnering with the engineering team to implement required tooling enhancements
- Enhance Stripe's incident response by leading and implementing improvements derived from analyzing user-facing incidents and extracting actionable insights and learnings
- Collaborate closely with executive leadership, engineering, and operations teams to lead significant programs and reshape workflows and metrics concerning reliability and incident operations
- Manage relevant TTx metrics, particularly those related to communication and escalation. Collaborate with engineering leadership to implement necessary improvements for each metric
- Develop user-focused metrics and data to guide Stripe's incident response, reliability strategy, and user communications (including RCAs), ensuring impactful decision-making
Preferred Qualifications
- Experience managing geographically dispersed teams
- Experience using infrastructure and application monitoring tools such as Prometheus, Sentry and others
- Experience in incident response at a high-growth technology company, preferably within the payments or e-commerce sectors
- Proven ability to apply Agentic and Generative AI to revolutionize incident response, coupled with a strong grasp of current industry trends in the incident response domain
- Demonstrated history of driving engineering and process enhancements to improve incident response efficiency within a rapidly expanding technology organization