Senior Software Engineer, Site Reliability

Logo of Gretel

Gretel

πŸ“Remote - Worldwide

Job highlights

Summary

Join our team as a Senior or Staff Site Reliability Engineer at Gretel to ensure the safety, security, and reliability of our cloud infrastructure.

Requirements

  • Experience with at least one cloud platform (we use AWS heavily)
  • Experience with Docker and Kubernetes
  • Ability to write software and tools in Python or Go
  • Experience with monitoring, alerting and operations
  • Experience operating highly available distributed systems in the cloud
  • Experience identifying, diagnosing, and responding to operational outages

Responsibilities

  • Build and maintain Gretel's observability stack
  • Measure and monitor Gretel's availability, latency, and overall system health
  • Scale systems sustainably with automation and continuously improve and evolve systems
  • Manage and lead incident response, recovery, and blameless postmortems
  • Partner with software engineers to troubleshoot production issues
  • Build tools and frameworks that help Gretel engineers be more productive
  • Ship complex ML/AI models in partnership with Gretel's applied science and engineering teams

Preferred Qualifications

  • Experience with infrastructure as code (Terraform, CloudFormation, etc)
  • Experience with build systems such as Bazel
  • Experiencing shipping application with complex dependencies (Pytorch, Tensorflow)
  • Software engineering skills beyond script writing (TDD, design patterns, etc)
  • Experience with DevOps or CI/CD pipelines

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Gretel know you found this job on JobsCollider. Thanks! πŸ™