Staff Infrastructure Site Reliability Engineer

closed
Netlify Logo

Netlify

đź’µ $92k-$165k
📍Remote - Worldwide

Summary

Join Netlify's growing SRE team as a Staff Site Reliability Engineer, where you'll lead high-impact reliability and infrastructure initiatives across the platform. You'll drive the adoption of Infrastructure-as-Code, manage cloud infrastructure components, define architectural standards, and provide mentorship to senior engineers. You'll collaborate with various teams to embed reliability into company-wide strategy, lead architecture reviews, and develop reliability metrics. You'll also participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making.

Requirements

  • Deep expertise in cloud architecture, with hands-on experience designing and deploying global-scale solutions on AWS, Azure, or GCP
  • Strong proficiency with Kafka or similar messaging systems, including deployment, scaling, and maintenance in multi-cloud environments
  • Solid experience in database design, performance tuning, and maintenance for both relational and NoSQL systems in high-throughput environments
  • Skilled in programming and scripting languages such as Go or Python, with a focus on automation and infrastructure tooling
  • A proven track record of leading large-scale, cross-team technical initiatives and delivering impactful infrastructure outcomes
  • Proficiency in configuration management tools like Ansible, Chef, or Puppet
  • Experience in managing CI/CD pipelines using tools such as Jenkins, GitLab CI, CircleCI, or similar
  • Excellent communication skills, with the ability to articulate complex technical strategies to executives and build consensus across diverse teams
  • Demonstrated success in setting and scaling technical standards and best practices across large engineering organizations

Responsibilities

  • Lead high-impact reliability and infrastructure initiatives across the platform
  • Drive the adoption of Infrastructure-as-Code and champion reliability-focused tooling and frameworks
  • Manage all cloud infrastructure components, including instances, networking, DNS, Terraform automation, and Kubernetes
  • Define and uphold architectural standards, best practices, and technical strategy for reliability at scale
  • Provide mentorship to senior engineers and tech leads, fostering systems thinking and operational excellence
  • Partner with Engineering, Product, and Executive teams to embed reliability into company-wide strategy
  • Lead architecture reviews and provide oversight for critical infrastructure projects
  • Develop and advocate for reliability metrics and SLO frameworks that align with business goals
  • Participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making

Preferred Qualifications

  • You think in systems. You’re curious about how infrastructure, networking, observability, and security connect—and enjoy breaking down complex challenges into clear, actionable strategies
  • You’re comfortable writing code (especially in Go) and enjoy automating infrastructure workflows, building tools to reduce manual effort, and supporting reliable operations at scale
  • You’ve collaborated on cross-functional initiatives—like operational readiness reviews, cloud migrations, or introducing monitoring standards—and know how to communicate clearly with both technical and non-technical teammates
  • You take a thoughtful, methodical approach to troubleshooting. You seek context before jumping to solutions, validate assumptions, and can clearly explain how you navigate production issues or potential incidents
  • You work well in a distributed environment and value clear, respectful communication. Whether async or live, you prioritize inclusivity, documentation, and creating space for others to contribute
  • You’re energized by helping others grow—whether that’s through mentoring, sharing knowledge, or building systems that support better outcomes across the team
  • You approach reliability as a proactive practice, not just a reactive one. You care about preventing issues before they become incidents and building systems that help everyone sleep better at night
  • You’re drawn to big, interesting challenges. The idea of helping shape a global CDN, support edge computing innovation, and rethink infrastructure for modern developers is what motivates you

Benefits

  • We are a remote-first, globally distributed group that values asynchronous communication, documentation, and a culture of transparency, empowerment, and collective ownership
  • Diversity and inclusion are at the heart of what we do, and we welcome team members from all backgrounds to bring their unique perspectives to our mission
  • Whether you’re launching a new phase of your career or growing an established one, Netlify offers a supportive environment where you can thrive while maintaining a healthy work-life balance
  • We welcome candidates based in Spain, Canada, or the UK for this position
  • Our base compensation for this role is targeted at €84,000 -  €113,000 for most Spain-based locations, CAD $163,000 - CAD $221,000 for most Canada-based locations, or ÂŁ94,000 - ÂŁ127,000 for most UK-based locations
  • Candidates outside these locations, or in premium markets, should consult with their Talent Acquisition partner regarding location-based ranges, as they may be higher or lower than the average ranges listed
  • The starting pay will be determined based on multiple factors, including expertise and skills, market demands, experience, internal equity, and applicable geographic location
  • These compensation packages and ranges are subject to change and may be modified in the future
This job is filled or no longer available

Similar Remote Jobs