Site Reliability Engineer

Roadie Logo

Roadie

πŸ“Remote - Worldwide

Summary

Join Roadie, a UPS company, as a Site Reliability Engineer and contribute to the reliability, scalability, and performance of our platform. You will support Kubernetes clusters, observability tools, and other systems. Responsibilities include identifying and remediating system bottlenecks, monitoring service level indicators, automating operations, and troubleshooting production issues. You will partner with senior engineers and participate in on-call rotations. Roadie offers competitive compensation, comprehensive health insurance, 401k matching, tuition assistance, flexible work schedules, unlimited PTO, and more.

Requirements

  • 3+ Years in various SRE roles
  • 3+ Years in various DevOPS/System Engineering roles
  • 3+ Years of experience building and managing production Kubernetes infrastructure
  • 3+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
  • Experience with Infrastructure as code such as Terraform or Crossplane
  • Experience with CI/CD Development tools (CircleCI, etc.)
  • Experience with GitOPS Tools (ArgoCD)
  • Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
  • Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
  • Must be able to work independently, be self-motivated and handle multiple priorities
  • Comfortable working in a fast-paced agile environment
  • Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly

Responsibilities

  • Support the reliability, scalability, and performance of our platform through hands-on work with our infrastructure and deployment pipelines
  • Assist in maintaining and operating Kubernetes clusters (EKS), as well as other systems including Elasticsearch, MSK, RDS, and Redis
  • Contribute to the deployment, tuning, and upkeep of observability tools like Prometheus, Loki, Grafana, OpenTelemetry, and New Relic
  • Partner with more senior engineers to identify and remediate system bottlenecks and improve resource utilization
  • Participate in the monitoring and tracking of service level indicators (SLIs) and service level objectives (SLOs)
  • Write scripts and build automation to streamline operations and reduce manual work
  • Help troubleshoot production and non-production issues as part of the incident response process
  • Participate in an on-call rotation

Benefits

  • Competitive compensation packages
  • 100% covered health insurance premiums for yourself
  • 401k with company match
  • Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!)
  • Flexible work schedule with unlimited PTO
  • Monthly 3-day weekends
  • Monthly WFH stipend
  • Paid sabbatical leave- tenured team members are given time to rest, relax, and explore

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.