Incident Response Manager

closed
Stripe Logo

Stripe

πŸ“Remote - Worldwide

Summary

Join Stripe's Incident Ops team as an Incident Response Manager (IRM) and play a key role in driving incident response and management. You will lead incident resolution, collaborate with cross-functional teams, and ensure timely communication with users. Responsibilities include acting as Incident Commander, leading user-facing incidents, and contributing to root cause analysis. The ideal candidate possesses 5+ years of major incident experience, strong technical skills, and excellent communication abilities. Preferred qualifications include domain expertise in various incident classes and experience with user-facing communications. This is a 24/7 role requiring strong problem-solving and decision-making skills in high-pressure situations.

Requirements

  • 5+ years of demonstrable major incident experience for organizations that run mission critical applications or always-on Saas environments
  • Demonstrated ability to lead multiple incidents concurrently with authority and influence responders with agency and reasoning skills to resolve ambiguous problems and drive to root cause
  • Strong full stack technical skills with development/support experience with cloud based technologies
  • Demonstrated experience developing code and automation using Python, Ruby, JavaScript or shell scripting
  • Solid understanding of infrastructure, including physical, virtual, and container-based compute platforms
  • Strong quantitative, and analytical skills in data manipulation using SQL, Splunk or other tools
  • Excellent task management skills, must be detail-oriented with ability to remain composed, methodical, and think fast in a high-pressured environment
  • Exceptional written and verbal English communication skills, with the ability to translate complex technical issues for internal and external stakeholders

Responsibilities

  • Act as an on-call Incident Commander, responsible for driving and managing incident resolution with a high level of urgency, cross-functional collaboration, and accuracy, while partnering with a global and diverse set of teams, including Engineering, Product, Policy, Risks, PR, Legal, Execs, etc
  • Lead all user-facing incidents across domains at Stripe - including reliability, technical, security, and data privacy
  • "User First" approach to determine impact, providing accurate situation reports, facilitating comms bridges, and ensuring useful and timely external communications to users
  • Proactively update internal stakeholders, make decisions through data and influence by partnering with Engineering, Sales, Support and other cross-functional teams
  • Contribute to the root cause analysis process while conducting post-mortems, remediations identification, and ensure problem management tasks meet SLA and user expectations
  • Drive improvements in the incident handling process and incident management metrics and tooling based on trends and data of Stripe's incidents in collaboration with engineering, product and operations teams
  • Collaborate closely with leadership for building team strategy based on the team vision
  • Collaborate and coach other Incident Response Managers on the team

Preferred Qualifications

  • Domain expertise in classes of incidents such as technical, privacy, security or crisis with a strong desire to continuously learn about Stripe's products, technical issues and systems
  • Ability to review complex technical details regarding ongoing issues/events and convey the key details to senior stakeholders to facilitate real-time decision making
  • Experience with broad user-facing communications (e.g. status pages, tweets) and/or targeted communications (e.g. direct emails, support ticket responses)
  • Familiarity operating or managing distributed architectures with the ability to correlate system behaviors based on known inter-dependencies
  • Demonstrated experience with full stack development and support
This job is filled or no longer available