Senior Incident Manager

Trustly Logo

Trustly

πŸ“Remote - Brazil

Summary

Join Trustly as an individual contributor to continuous monitoring, ensuring reliable, scalable, and efficient infrastructure and services. Guide a team of SREs, providing technical guidance and mentorship. Lead incident response, identify root causes, and implement preventative measures. Optimize system and application performance, drive automation initiatives, and generate regular reports on system reliability. Collaborate with cross-functional teams to define KPIs and develop reporting frameworks. Conduct capacity planning and prepare executive-level reports. Present findings and recommendations to management. This role requires a B.S. in Computer Science or a related field, experience leading DevOps or SRE teams, and expertise in scripting languages and cloud technologies. Advanced English is required.

Requirements

  • B.S. in Computer Science or a related field
  • Experience leading DevOps or SRE teams
  • IT project management experience
  • Expertise with scripting languages such as shell script and Python
  • Experience with supporting critical services in production in the cloud (AWS) and on-premises
  • Experience with network technologies and with system, security, and network monitoring tools
  • Technical knowledge of Databases and Linux operating system, its standards and best practices for keeping services up and running
  • Proactive approach to spotting problems, areas for improvement, removing the manual process and toil using code, and fixing performance concerns using code
  • Experience with building SLAs, SLIs, SLOs, and error budget, based on business rules
  • Advanced English
  • Strong communication and negotiation abilities
  • Relational profile
  • Ownership and macro vision of company structures
  • Capacity to work under pressure
  • Understand business language and intermediate technical and business communication
  • Clearly understanding of urgency and priority
  • Ability to critically analyze, synthesize and problem-solving

Responsibilities

  • Guide a team of SREs: Provide technical guidance, mentorship, and support to a team of SRE engineers. Foster a collaborative and inclusive team environment. Define the roadmap following business requirements and collaborate with the team to ensure the execution aligning the priorities with business requirements
  • Incident management: Lead incident response efforts, identify root causes, and implement preventative measures to minimize future incidents
  • Performance optimization: Identify performance bottlenecks, conduct performance analysis, and optimize system and application performance
  • Automation and tooling: Drive automation initiatives, develop and maintain tools, scripts, and frameworks to streamline deployment, monitoring, and troubleshooting processes
  • Generate regular reports on system reliability, uptime, and performance metrics to provide insights and visibility to stakeholders
  • Collaborate with cross-functional teams to define key performance indicators (KPIs) and develop reporting frameworks to track and monitor system health and operational efficiency
  • Do capacity planning for infrastructure based on environment metrics and expected growth
  • Prepare executive-level reports summarizing incidents, their resolutions, and recommendations for improvements
  • Present findings, trends, and recommendations to management and stakeholders, providing actionable insights to drive decision-making processes

Benefits

  • Bradesco health and dental plan, for you and your dependents, with no co-payment cost
  • Life insurance with differentiated coverage
  • Meal voucher and supermarket voucher
  • Home Office Allowance
  • Wellhub - Platform that gives access to spaces for physical activities and online classes
  • Trustly Club - Discount at educational institutions and partner stores
  • Monthly happy hours with iFood coupon
  • English Program - Online group classes with a private teacher
  • Extended maternity and paternity leave
  • Birthday Off
  • Flexible hours/Home Office - our culture is remote-first! You can work in every city in Brazil
  • Welcome Kit - We work with Apple equipment (Macbook Pro, iPhone) and we send many more treats! Spoiler alert: Equipment can be purchased by you according to internal criteria!
  • Annual premium - As a member of our team, you are eligible to receive an annual bonus, at the company's discretion, based on the achievement of our KPIs and individual performance
  • Referral Program - If you refer a candidate and we hire the person, you will receive a reward for that!

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.