Senior Site Reliability Engineer at Wikimedia Foundation

Summary

Join the Wikimedia Foundation as a Senior Site Reliability Engineer (SRE) and contribute to the infrastructure that powers Wikipedia and other Wikimedia projects. As a member of the Service Operations SRE team, you will design, implement, and maintain the infrastructure and services supporting Wikimedia's projects, including Kubernetes clusters and application servers. You will participate in 24/7 incident response, collaborate with a global team, and mentor peers. The role involves automating tasks, optimizing performance, and proactively identifying sources of instability. You will also lead incident response and post-incident reviews. This position requires frequent work with other SRE team members and interaction with teams outside of SRE.

Requirements

5+ years of experience in an SRE/Operations/DevOps role
Experience with operating highly available infrastructure
Experience with running applications and services at scale
Experience implementing containerization solutions (Docker, Kubernetes)
Proficient with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, TerraForm etc.)
Communicative technical English

Responsibilities

Design, implementation and maintenance of public facing infrastructure and services
Use of configuration management and deployment tools
Architectural design and operation at scale
Monitoring of systems and services, optimization of performance and resource utilization
Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective
Common operating system level tasks such as logging and backup / restore
Cookbook / runbook implementation for common maintenance actions
Participate in 24/7 on-call rotation and escalations for resolving production issues
Lead incident response and post-incident reviews, contributing to failure analysis and implementing preventive measures
Automation and streamlining of tasks as well as identifying process gaps
Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
Mentoring peers in your areas of technical and operational strength
Expected to travel domestically or potentially internationally 2-3 times in a year for team gatherings and conferences

Preferred Qualifications

Experience with package management for operating systems (Debian, etc)
We are avid supporters (and users) of open source software; history of contributing to Open Source projects is valued
Familiarity with RFC 2549
Prior participation in the Wikimedia movement

Benefits

The anticipated annual pay range of this position for applicants based within the United States is US$ 109,047 to US$ 169,455 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
For applicants located outside of the US, the pay range will be adjusted to the country of hire
We neither ask for nor take into consideration the salary history of applicants

Senior Site Reliability Engineer

Wikimedia Foundation

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Trase

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior