Senior Site Reliability Engineer

closed
Varo Bank Logo

Varo Bank

πŸ’΅ $150k-$190k
πŸ“Remote - United States

Summary

Join Varo's SRE team to design, build, and run large-scale distributed systems that power most of Varo's operations. As a Site Reliability Engineer, you'll take ownership of infrastructure availability and resiliency, write infrastructure as code, and improve observability and monitoring across Varo's infrastructure.

Requirements

  • 8+ years as a Site Reliability, DevOps, or Software Engineer with proficiency in one or more high-level languages (such as Python, GoLang, Ruby, Java, or JavaScript) required
  • Excellent Linux and troubleshooting skills
  • Experience in building and supporting high-availability cloud environments in AWS
  • Experience using Infrastructure as code (IaC) and deployment automation with tools such as Terraform, Helm, Gitlab or equivalent
  • Experience running Kubernetes in production
  • Participate in an on-call rotation for after-hours production infrastructure incidents
  • Experience with SDLC, CI/CD, and related tooling

Responsibilities

  • Take ownership of the availability and resiliency of Varo's cloud-based infrastructure
  • Design and maintain disaster recovery scenarios
  • Write and maintain infrastructure as code for core systems (Terraform, Terraform modules and Kubernetes helm charts)
  • Build and maintain CI/CD pipelines
  • Improve observability and monitoring across Varo's infrastructure by implementing advanced tools and technologies
  • Create and maintain monitoring dashboards, alerts, and log systems to quickly identify and resolve issues
  • Implement advanced observability tools like distributed tracing and anomaly detection for deeper system insights and efficient troubleshooting
  • Help lead high-profile incidents and facilitate blameless post-mortems
  • Collaborate with development teams to implement and improve SLIs and SLOs for their services and to promote service ownership
  • Use monitoring data to drive actionable insights and contribute to incident response strategies
  • Automate operational tasks to save time and improve accuracy
  • Write clean and scalable scripts, software and systems to manage platform infrastructure and applications
This job is filled or no longer available