Senior Site Reliability Engineer

Wikimedia Foundation Logo

Wikimedia Foundation

πŸ“Remote - Worldwide

Summary

Join the Wikimedia Foundation as a Senior Site Reliability Engineer (SRE) and contribute to the infrastructure that powers Wikipedia and other Wikimedia projects. As a member of the Service Operations SRE team, you will design, implement, and maintain the infrastructure and services supporting Wikimedia's projects, including Kubernetes clusters and application servers. You will participate in 24/7 incident response, collaborate with a global team, and mentor peers. This role requires experience in operating highly available infrastructure at scale, proficiency in shell scripting and programming languages, and familiarity with configuration management tools. The position involves on-call rotation and occasional domestic or international travel. The Wikimedia Foundation is a remote-first organization offering competitive salaries and benefits.

Requirements

  • 5+ years of experience in an SRE/Operations/DevOps role
  • Experience with operating highly available infrastructure
  • Experience with running applications and services at scale
  • Experience implementing containerization solutions (Docker, Kubernetes)
  • Proficient with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
  • Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, TerraForm etc.)
  • Communicative technical English

Responsibilities

  • Design, implementation and maintenance of public facing infrastructure and services
  • Use of configuration management and deployment tools
  • Architectural design and operation at scale
  • Monitoring of systems and services, optimization of performance and resource utilization
  • Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective
  • Common operating system level tasks such as logging and backup / restore
  • Cookbook / runbook implementation for common maintenance actions
  • Participate in 24/7 on-call rotation and escalations for resolving production issues
  • Lead incident response and post-incident reviews, contributing to failure analysis and implementing preventive measures
  • Automation and streamlining of tasks as well as identifying process gaps
  • Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
  • Mentoring peers in your areas of technical and operational strength
  • Expected to travel domestically or potentially internationally 2-3 times in a year for team gatherings and conferences

Preferred Qualifications

  • Experience with package management for operating systems (Debian, etc)
  • We are avid supporters (and users) of open source software; history of contributing to Open Source projects is valued
  • Familiarity with RFC 2549
  • Prior participation in the Wikimedia movement

Benefits

  • Salaries at the Wikimedia Foundation are set in a way that is competitive, equitable, and consistent with our values and culture
  • The anticipated annual pay range of this position for applicants based within the United States is US$ 109,047 to US$ 169,455 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
  • For applicants located outside of the US, the pay range will be adjusted to the country of hire
  • We neither ask for nor take into consideration the salary history of applicants
  • The compensation for a successful applicant will be based on their skills, experience and location

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.