Senior Site Reliability Engineer

Turvo Logo

Turvo

πŸ“Remote - United States

Summary

Join Turvo's SRE team as a hands-on Senior Site Reliability Engineer and make a significant impact. Collaborate with a global team to ensure customer satisfaction and exceed expectations. This remote role, based in Dallas, TX, requires experience in SRE, Production Support, and Elastic Kubernetes Service (EKS). You will manage application availability, performance, and capacity planning, establish standardized practices, and drive SLI/SLO measurement. The ideal candidate will have a strong technical background in cloud infrastructure, distributed systems, and Kubernetes. Turvo offers competitive salaries, bonuses, and a comprehensive benefits package.

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, or a similar discipline
  • 10+ years experience in SRE, DevOps and/or Information Technology
  • Must have previous role(s) in SRE/production support in a large-scale environment
  • Strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices
  • Strong hands-on experience with Kubernetes (EKS) in production environments
  • Proficiency with AWS infrastructure and services (EC2, S3, RDS, IAM)
  • Hands-on experience with tools such as ELK (Elasticsearch, Logstash, Kibana), Grafana, CloudWatch, Jenkins, and Jira
  • Proficient in one of scripting/programming languages (Java, Python)
  • Significant Experience with relational databases (MySQL) and NoSQL (preferably Mongo DB)
  • Solid experience with Docker and Infrastructure-as-Code tools like Terraform or Cloud Formation
  • Strong troubleshooting/problem-solving skills with the ability to make swift informed judgment calls
  • Strong written and verbal communication skills with demonstrated ability to communicate effectively with all levels of an organization
  • Must be eager to continuously improve customer experience by collaborating with engineering leads, product, and customer success teams
  • Passionate and collaborative team player with a strong work ethic and focus on achieving shared goals
  • Security background and understanding of SaaS platform security

Responsibilities

  • Manage the complete application availability, performance, efficiency, and capacity planning lifecycle while ensuring round-the-clock monitoring for a highly scalable and reliable platform
  • Establish standardized practices for monitoring, incident response, blameless postmortems, releases, and other maintenance activities
  • Create, prioritize, communicate, and execute a roadmap for the site reliability function to align with organizational goals
  • Drive and manage the measurement of SLI/SLO, ensuring the team meets established goals for availability and SLA
  • Manage and resolve cross-team performance issues, from identifying the root cause to determining and implementing improvements
  • Collaborate with engineering leads to influence and prioritize resiliency and reliability efforts through code, monitoring feedback, and process enhancements

Benefits

  • Great health, dental, vision benefits
  • Competitive salaries and bonuses
  • 401k with employer match
  • Learning & development opportunities
  • Paid parental leave
  • Focus on work-life balance
  • Monthly wellness day

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs