Senior Site Reliability Engineer

Roadie Logo

Roadie

πŸ“Remote - Worldwide

Summary

Join Roadie, a UPS company, as a Senior Site Reliability Engineer and contribute to the optimization and reliability of our platform. You will build systems, maintain Kubernetes clusters, deploy monitoring solutions, and collaborate with cross-functional teams. This role requires extensive experience in SRE, DevOps, Kubernetes, and AWS. We offer competitive compensation, comprehensive health insurance, 401k matching, tuition assistance, flexible work schedule with unlimited PTO, and more. The ideal candidate is a skilled problem-solver with a strong understanding of site reliability practices and a willingness to learn.

Requirements

  • 5+ Years in various SRE roles
  • 5+ Years in various DevOPS/System Engineering roles
  • 5+ Years of experience building and managing production Kubernetes infrastructure
  • 6+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
  • Experience with Infrastructure as code such as Terraform or Crossplane
  • Experience with CI/CD Development tools (CircleCI, etc.)
  • Experience with GitOPS Tools (ArgoCD)
  • Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
  • Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
  • Must be able to work independently, be self-motivated and handle multiple priorities
  • Comfortable working in a fast-paced agile environment
  • Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly

Responsibilities

  • Build systems that optimize the uptime and reliability of our platform, and support the management and optimization of our software delivery pipeline, observability and infrastructure operations
  • Maintain, support, and engineer production and non-production Kubernetes Clusters (EKS) as well as ES, MSK, RDS, and EC (Redis) clusters
  • Deploy and maintain monitoring and logging solutions based on Prometheus, Loki, Thanos, Grafana, OpenTelemetry and New Relic
  • Collaborate with cross-functional teams to identify and address potential bottlenecks, optimize resource utilization, and proactively prevent system failures
  • Define and manage SLO, SLI and error budgets
  • Develop processes, tools and automation to reduce toil across engineering teams
  • Plan and forecast service capacity and demand, assess cost optimization, and tune systems and software
  • Debug production / non-production issues
  • Take part in 24/7 on-call rotation

Benefits

  • Competitive compensation packages
  • 100% covered health insurance premiums for yourself
  • 401k with company match
  • Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!)
  • Flexible work schedule with unlimited PTO
  • Monthly 3-day weekends
  • Monthly WFH stipend
  • Paid sabbatical leave- tenured team members are given time to rest, relax, and explore
  • The technology you need to get the job done

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs