Senior Service Reliability Engineer

Thoughtworks Logo

Thoughtworks

πŸ“Remote - Chile

Summary

Join Thoughtworks as a Service Reliability Engineer (SRE GCP) and ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience, and system performance, you will champion Site Reliability Engineering principles. You will integrate automation, monitoring, and incident response, facilitating a customer-focused and agile approach. Emphasize shared responsibility and continuous improvement while cultivating a collaborative culture to exceed reliability and business objectives. You will improve site reliability by building mechanisms and architectures that enable fault tolerance and faster response times. You will also work closely with application development teams, advising on system reliability improvements.

Requirements

  • Have hands-on experience in programming and scripting languages such as Python, Go or Bash
  • Have a good understanding in Cloud GCP
  • Have had exposure to observability tools such as Grafana, Datadog, NewRelic, ELK Stack, Dynatrace or equivalent and you are proficient in using data from these tools to dissect and identify root causes of system and infrastructure issues
  • Be familiar with DevOps and GitOps practices
  • Have a good knowledge of container-based architecture and orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
  • Understand technical architecture and modern design patterns, including microservices, serverless functions, NoSQL and RESTful APIs, with experience in fixing bugs, analyzing logs, building metrics and operational dashboards
  • Be familiar with creating infrastructure resources for improving reliability of system that follows Cloud’s Well Architected Framework principles: Reliability, security, cost optimization, performance efficiency and operational
  • Have strong communication and articulation skills, and are proficient in English
  • Have good people skills with an emphasis on negotiation and close collaboration with multiple cross-functional teams from the client side and/or Thoughtworks
  • Solve challenging problems and difficult to debug issues with a never give up attitude
  • Have the ability to work under pressure and with composure during production incidents
  • Confidently recommend improvements backed by strong technical arguments to client stakeholders or application development teams
  • Be able to understand requirements provided by the client on both technical and business aspects and break them down for successful implementation
  • Have a strong drive and ownership mentality, with a willingness to sign up for and deliver work when called upon, without being too concerned about role boundaries
  • Be willing to be part of a rotation- and need-based 24x7 available team

Responsibilities

  • Improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect
  • Drive the integration of observability automation into the CI/CD pipeline
  • Handle production incidents, manage incident communication with clients and draft root cause analysis documents
  • Monitor performance of production systems and improve their scaling to ensure business goals are met within expected SLA and SLO metrics
  • Work closely with application development teams as advisors on improving system reliability and assisting in implementation for reliability improvements
  • Improve system observability across multiple facets such as logging and metrics, reducing false alarms to eliminate unnecessary toil and improving process efficiency
  • Implement chaos engineering practices as necessary to test system reliability, setting up processes for such testing to be done regularly
  • Have a clear understanding of client goals and business needs and setting direction for site reliability in line with the same, e.g.: Achieving application availability with minimum/no disruption (99.999%) if necessary for business

Benefits

  • There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you
  • Your career is supported by interactive tools, numerous development programs and teammates who want to help you grow
  • We see value in helping each other be our best and that extends to empowering our employees in their career journeys
  • #LI-Remote

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.