Service Reliability Engineer
Thoughtworks
Job highlights
Summary
Join Thoughtworks as a Service Reliability Engineer (SRE) and champion Site Reliability Engineering principles. You will focus on reliability, resilience, and system performance, integrating automation, monitoring, and incident response. Responsibilities include understanding SRE goals, improving reliability, enhancing incident management, managing stakeholder expectations, collaborating with engineering teams, identifying performance enhancement opportunities, mentoring other SREs, and collaborating with application development leads and solution architects. You will leverage your expertise in various technologies and possess strong communication and problem-solving skills. Thoughtworks offers a supportive culture with opportunities for professional development.
Requirements
- Program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby or Java
- Be familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e.g.: GitLab, Jenkins, CircleCI or equivalent
- Have in-depth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
- Have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
- Have a strong understanding of container-based architecture and hands-on experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
- Have in-depth experience in application and infrastructure performance tuning and scaling to handle heavy loads under different scenarios e.g.: Periodic traffic load and tsunami patterns
- Have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, end-user monitoring and performance testing
- Have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
- Have strong communication and articulation skills, and be proficient in English
- Be able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus
- Have excellent problem-solving and analytical skills, with a focus on continuous improvement
- Have good listening and presentation skills
- Solve challenging problems and difficult to debug issues with a never give up attitude
- Collaborate with cross-functional engineering teams to conduct capacity planning and scalability assessments, and design solutions for handling current and future growth
- Have the ability to work under pressure, with composure, during production incidents
- Understand requirements provided by the client on both technical and business aspects, and can break them down for successful implementation
- Be willing to be part of a rotation- and need-based, 24x7 available team
Responsibilities
- Understand requirements or SRE goals in depth from both tech and business perspectives
- Provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
- Enhance the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, post-mortem analysis and implementation of corrective actions
- Manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for C-level executives, if or when needed
- Be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
- Identify opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
- Collaborate with Thoughtworks application development leads and solution architects, recommending changes in system design and adopting best practices for improved reliability from day one
- Oversee and mentor other SREs on the team, contributing to their growth and development
Benefits
Learning & Development opportunities
Share this job:
Similar Remote Jobs
- π°$244k-$304kπUnited States
- π°$204k-$259kπUnited States
- π°$177k-$213kπUnited States
- πJapan
- π°$60k-$120kπAsia
- πMexico
- πRomania
- π°$78k-$135kπUnited States
- πAustralia