Summary
Join CoreWeave, a leading AI hyperscaler, as the Director of Production Engineering to lead and expand the SRE team. You will define and execute the SRE vision, strategy, and roadmap for a large-scale, distributed cloud infrastructure. This role involves leading and mentoring a high-performing team, championing automation-first practices, and establishing best practices in observability and monitoring. You will also drive initiatives for incident management and collaborate with various teams to build resilient systems. The ideal candidate is a thoughtful leader with technical depth and strategic vision, thriving in fast-paced environments. CoreWeave offers a competitive salary and a comprehensive benefits package.
Requirements
- Bachelorβs degrees in Computer Science, Engineering, or related fields
- 10+ years of engineering leadership roles within SRE, DevOps, or cloud infrastructure
- 5+ years in managing large-scale infrastructure-as-service in a geographically distributed, always-on environment
- Proven success leading 24x7 operations teams and delivering high-availability services at scale
- Deep expertise in automation, monitoring/observabilities, and incident response frameworks
- Familiarity with AI purpose-built cloud-native architectures, CI/CD systems, and performance tuning
Responsibilities
- Define and execute the SRE vision, strategy, and roadmap for a large-scale, distributed cloud infrastructure
- Lead and mentor a high-performing team of SREs, promoting a culture of ownership, collaboration, and continuous learning
- Champion automation-first practices, leveraging tools like Terraform, Kubernetes, and Infrastructure-as-Code to minimize toil and manual interventions
- Establish and evolve best practices in observability, monitoring, and alerting, ensuring the platform is proactive, not reactive
- Drive initiatives for incident management, postmortem culture, root cause analysis, and system hardening
- Collaborate with engineering, product, and customer support teams to build scalable, resilient, and self-healing systems
- Evolve our on-call strategy and processes to support a 24x7, globally distributed platform with minimal disruptions
Preferred Qualifications
- Hands-on experience with Python, Go, Java, or Ruby for operational tooling and automation
- Strong track record of hiring, mentoring, and developing top-tier SRE talent in high-growth companies
- Comfortable navigating cross-functional dynamics and influencing leadership across engineering, product, and support
- Experience leading DevOps and reliability transformation projects, improving developer velocity and platform resilience
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations
- A casual work environment
- A work culture focused on innovative disruption
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.