Staff Infrastructure Site Reliability Engineer at Netlify

Summary

Join Netlify's growing SRE team as a Staff Site Reliability Engineer, where you'll lead high-impact reliability and infrastructure initiatives across the platform. You'll drive the adoption of Infrastructure-as-Code, manage cloud infrastructure components, define architectural standards, and provide mentorship to senior engineers. You'll collaborate with various teams to embed reliability into company-wide strategy, lead architecture reviews, and develop reliability metrics. You'll also participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making.

Requirements

Deep expertise in cloud architecture, with hands-on experience designing and deploying global-scale solutions on AWS, Azure, or GCP
Strong proficiency with Kafka or similar messaging systems, including deployment, scaling, and maintenance in multi-cloud environments
Solid experience in database design, performance tuning, and maintenance for both relational and NoSQL systems in high-throughput environments
Skilled in programming and scripting languages such as Go or Python, with a focus on automation and infrastructure tooling
A proven track record of leading large-scale, cross-team technical initiatives and delivering impactful infrastructure outcomes
Proficiency in configuration management tools like Ansible, Chef, or Puppet
Experience in managing CI/CD pipelines using tools such as Jenkins, GitLab CI, CircleCI, or similar
Excellent communication skills, with the ability to articulate complex technical strategies to executives and build consensus across diverse teams
Demonstrated success in setting and scaling technical standards and best practices across large engineering organizations

Responsibilities

Lead high-impact reliability and infrastructure initiatives across the platform
Drive the adoption of Infrastructure-as-Code and champion reliability-focused tooling and frameworks
Manage all cloud infrastructure components, including instances, networking, DNS, Terraform automation, and Kubernetes
Define and uphold architectural standards, best practices, and technical strategy for reliability at scale
Provide mentorship to senior engineers and tech leads, fostering systems thinking and operational excellence
Partner with Engineering, Product, and Executive teams to embed reliability into company-wide strategy
Lead architecture reviews and provide oversight for critical infrastructure projects
Develop and advocate for reliability metrics and SLO frameworks that align with business goals
Participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making

Preferred Qualifications

You think in systems. You’re curious about how infrastructure, networking, observability, and security connect—and enjoy breaking down complex challenges into clear, actionable strategies
You’re comfortable writing code (especially in Go) and enjoy automating infrastructure workflows, building tools to reduce manual effort, and supporting reliable operations at scale
You’ve collaborated on cross-functional initiatives—like operational readiness reviews, cloud migrations, or introducing monitoring standards—and know how to communicate clearly with both technical and non-technical teammates
You take a thoughtful, methodical approach to troubleshooting. You seek context before jumping to solutions, validate assumptions, and can clearly explain how you navigate production issues or potential incidents
You work well in a distributed environment and value clear, respectful communication. Whether async or live, you prioritize inclusivity, documentation, and creating space for others to contribute
You’re energized by helping others grow—whether that’s through mentoring, sharing knowledge, or building systems that support better outcomes across the team
You approach reliability as a proactive practice, not just a reactive one. You care about preventing issues before they become incidents and building systems that help everyone sleep better at night
You’re drawn to big, interesting challenges. The idea of helping shape a global CDN, support edge computing innovation, and rethink infrastructure for modern developers is what motivates you

Benefits

We are a remote-first, globally distributed group that values asynchronous communication, documentation, and a culture of transparency, empowerment, and collective ownership
Diversity and inclusion are at the heart of what we do, and we welcome team members from all backgrounds to bring their unique perspectives to our mission
Whether you’re launching a new phase of your career or growing an established one, Netlify offers a supportive environment where you can thrive while maintaining a healthy work-life balance
We welcome candidates based in Spain, Canada, or the UK for this position
Our base compensation for this role is targeted at €84,000 - €113,000 for most Spain-based locations, CAD $163,000 - CAD $221,000 for most Canada-based locations, or £94,000 - £127,000 for most UK-based locations
Candidates outside these locations, or in premium markets, should consult with their Talent Acquisition partner regarding location-based ranges, as they may be higher or lower than the average ranges listed
The starting pay will be determined based on multiple factors, including expertise and skills, market demands, experience, internal equity, and applicable geographic location
These compensation packages and ranges are subject to change and may be modified in the future

Staff Infrastructure Site Reliability Engineer

Netlify

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Similar Remote Jobs

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Addepar

Remote

DevOps

Mid-level