Director of Reliability Engineering

Astronomer Logo

Astronomer

💵 $260k-$290k
📍Remote - United States

Summary

Join Astronomer as the Director of Reliability Engineering and lead global reliability initiatives, playing a central role in supporting critical services for worldwide companies. This strategic leadership position involves defining, driving, and evolving operational excellence, platform reliability, and automation at scale across our cloud-native infrastructure. You will lead and mentor high-performing SRE teams, collaborate cross-functionally, and ensure a seamless and resilient customer experience. The role demands defining the strategic direction for SRE and reliability, collaborating with engineers and product managers, owning service availability and performance, and building automation to prevent issues. You will also champion observability and self-healing systems, evolve incident management processes, drive SLO adoption, manage global on-call rotations, and support on-call culture. Finally, you will partner with various teams to improve reliability and cultivate a culture of continuous improvement.

Requirements

  • 10+ years of experience in software engineering, SRE, or DevOps roles
  • 5+ years in a technical leadership capacity, ideally in a high-growth, cloud-native SaaS environment
  • Proven success operating and scaling large-scale, distributed, mission-critical systems
  • Deep expertise in public cloud platforms (AWS, Azure, or GCP)
  • Hands-on knowledge of infrastructure as code (Terraform, CloudFormation), container orchestration (Kubernetes), and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk)
  • Experience implementing and managing CI/CD pipelines and secure development practices
  • Demonstrated ability to hire, grow, and lead globally distributed SRE teams
  • Strong decision-making, communication, and cross-functional collaboration skills

Responsibilities

  • Define and lead the strategic direction for SRE, reliability, and operational excellence across the organization
  • Collaborate with Software Engineers and Product Managers on projects that impact users and be directly responsible for service uptime
  • Own end-to-end availability and performance of key services; build automation to prevent recurrence of issues and automate responses to all non-exceptional service conditions
  • Design, write, and deliver software to improve the availability, scalability, latency, and efficiency of services
  • Champion observability, automation, and self-healing systems to proactively prevent downtime and reduce manual toil
  • Evolve and manage our incident and change management processes, including root cause analysis and postmortems
  • Drive adoption of SLOs, SLIs, and error budgets to align engineering efforts with business priorities
  • Work with operational support to manage global on-call rotations using a follow-the-sun model to ensure around-the-clock coverage
  • Support on-call culture by defining best practices for incident response, escalation policies, and operational readiness
  • Partner closely with engineering, product, security, and program management teams to improve reliability without slowing innovation
  • Cultivate a culture of continuous improvement, high accountability, and blameless incident management
  • Lead and mentor the team, establishing credibility through high-quality technical execution
  • Provide strong mentorship and leadership to grow the next generation of reliability and engineering leaders

Preferred Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
  • Experience managing vendor relationships and partnerships
  • Comfortable presenting to executive stakeholders in high-stakes environments
  • Proven ability to scale operations during rapid business or organizational growth
  • Strong analytical mindset with the ability to evaluate trade-offs between reliability, speed, and innovation

Benefits

  • The estimated salary for this role ranges from $260,000 - $290,000, along with an equity component
  • Astronomer is a remote-first company

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.