Senior Site Reliability Engineer at Wikimedia Foundation

Summary

Join the Wikimedia Foundation as a Senior Site Reliability Engineer (SRE) and contribute to the infrastructure that powers Wikipedia and other Wikimedia projects. As a member of the Service Operations SRE team, you will design, implement, and maintain the infrastructure and services supporting Wikimedia's projects, including Kubernetes clusters and application servers. You will participate in 24/7 incident response, collaborate with a global team, and mentor peers. This role requires experience in operating highly available infrastructure at scale, proficiency in shell scripting and programming languages, and familiarity with configuration management tools. The position involves on-call rotation and occasional domestic or international travel. The Wikimedia Foundation is a remote-first organization offering competitive salaries and benefits.

Requirements

5+ years of experience in an SRE/Operations/DevOps role
Experience with operating highly available infrastructure
Experience with running applications and services at scale
Experience implementing containerization solutions (Docker, Kubernetes)
Proficient with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, TerraForm etc.)
Communicative technical English

Responsibilities

Design, implementation and maintenance of public facing infrastructure and services
Use of configuration management and deployment tools
Architectural design and operation at scale
Monitoring of systems and services, optimization of performance and resource utilization
Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective
Common operating system level tasks such as logging and backup / restore
Cookbook / runbook implementation for common maintenance actions
Participate in 24/7 on-call rotation and escalations for resolving production issues
Lead incident response and post-incident reviews, contributing to failure analysis and implementing preventive measures
Automation and streamlining of tasks as well as identifying process gaps
Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
Mentoring peers in your areas of technical and operational strength
Expected to travel domestically or potentially internationally 2-3 times in a year for team gatherings and conferences

Preferred Qualifications

Experience with package management for operating systems (Debian, etc)
We are avid supporters (and users) of open source software; history of contributing to Open Source projects is valued
Familiarity with RFC 2549
Prior participation in the Wikimedia movement

Benefits

Salaries at the Wikimedia Foundation are set in a way that is competitive, equitable, and consistent with our values and culture
The anticipated annual pay range of this position for applicants based within the United States is US$ 109,047 to US$ 169,455 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
For applicants located outside of the US, the pay range will be adjusted to the country of hire
We neither ask for nor take into consideration the salary history of applicants
The compensation for a successful applicant will be based on their skills, experience and location

Senior Site Reliability Engineer

Wikimedia Foundation

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Trase

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior