Summary
Join Netlify's growing SRE team as a Staff Site Reliability Engineer, where you'll lead high-impact reliability and infrastructure initiatives across the platform. You'll drive the adoption of Infrastructure-as-Code, manage cloud infrastructure components, define architectural standards, and provide mentorship to senior engineers. You'll collaborate with various teams to embed reliability into company-wide strategy, lead architecture reviews, and develop reliability metrics. You'll also participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making.
Requirements
- Deep expertise in cloud architecture, with hands-on experience designing and deploying global-scale solutions on AWS, Azure, or GCP
- Strong proficiency with Kafka or similar messaging systems, including deployment, scaling, and maintenance in multi-cloud environments
- Solid experience in database design, performance tuning, and maintenance for both relational and NoSQL systems in high-throughput environments
- Skilled in programming and scripting languages such as Go or Python, with a focus on automation and infrastructure tooling
- A proven track record of leading large-scale, cross-team technical initiatives and delivering impactful infrastructure outcomes
- Proficiency in configuration management tools like Ansible, Chef, or Puppet
- Experience in managing CI/CD pipelines using tools such as Jenkins, GitLab CI, CircleCI, or similar
- Excellent communication skills, with the ability to articulate complex technical strategies to executives and build consensus across diverse teams
- Demonstrated success in setting and scaling technical standards and best practices across large engineering organizations
Responsibilities
- Lead high-impact reliability and infrastructure initiatives across the platform
- Drive the adoption of Infrastructure-as-Code and champion reliability-focused tooling and frameworks
- Manage all cloud infrastructure components, including instances, networking, DNS, Terraform automation, and Kubernetes
- Define and uphold architectural standards, best practices, and technical strategy for reliability at scale
- Provide mentorship to senior engineers and tech leads, fostering systems thinking and operational excellence
- Partner with Engineering, Product, and Executive teams to embed reliability into company-wide strategy
- Lead architecture reviews and provide oversight for critical infrastructure projects
- Develop and advocate for reliability metrics and SLO frameworks that align with business goals
- Participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making
Preferred Qualifications
- You think in systems. You’re curious about how infrastructure, networking, observability, and security connect—and enjoy breaking down complex challenges into clear, actionable strategies
- You’re comfortable writing code (especially in Go) and enjoy automating infrastructure workflows, building tools to reduce manual effort, and supporting reliable operations at scale
- You’ve collaborated on cross-functional initiatives—like operational readiness reviews, cloud migrations, or introducing monitoring standards—and know how to communicate clearly with both technical and non-technical teammates
- You take a thoughtful, methodical approach to troubleshooting. You seek context before jumping to solutions, validate assumptions, and can clearly explain how you navigate production issues or potential incidents
- You work well in a distributed environment and value clear, respectful communication. Whether async or live, you prioritize inclusivity, documentation, and creating space for others to contribute
- You’re energized by helping others grow—whether that’s through mentoring, sharing knowledge, or building systems that support better outcomes across the team
- You approach reliability as a proactive practice, not just a reactive one. You care about preventing issues before they become incidents and building systems that help everyone sleep better at night
- You’re drawn to big, interesting challenges. The idea of helping shape a global CDN, support edge computing innovation, and rethink infrastructure for modern developers is what motivates you
Benefits
- We are a remote-first, globally distributed group that values asynchronous communication, documentation, and a culture of transparency, empowerment, and collective ownership
- Diversity and inclusion are at the heart of what we do, and we welcome team members from all backgrounds to bring their unique perspectives to our mission
- Whether you’re launching a new phase of your career or growing an established one, Netlify offers a supportive environment where you can thrive while maintaining a healthy work-life balance
- We welcome candidates based in Spain, Canada, or the UK for this position
- Our base compensation for this role is targeted at €84,000 - €113,000 for most Spain-based locations, CAD $163,000 - CAD $221,000 for most Canada-based locations, or £94,000 - £127,000 for most UK-based locations
- Candidates outside these locations, or in premium markets, should consult with their Talent Acquisition partner regarding location-based ranges, as they may be higher or lower than the average ranges listed
- The starting pay will be determined based on multiple factors, including expertise and skills, market demands, experience, internal equity, and applicable geographic location
- These compensation packages and ranges are subject to change and may be modified in the future