Senior Site Reliability Engineer

Wikimedia Foundation Logo

Wikimedia Foundation

πŸ“Remote - Worldwide

Summary

Join the Wikimedia Foundation as a Senior Site Reliability Engineer (SRE) and contribute to the infrastructure that powers Wikipedia and other Wikimedia projects. As a member of the Service Operations SRE team, you will design, implement, and maintain the infrastructure and services supporting Wikimedia's projects, including Kubernetes clusters and application servers. You will participate in 24/7 incident response, collaborate with a global team, and mentor peers. The role involves automating tasks, optimizing performance, and proactively identifying sources of instability. You will also lead incident response and post-incident reviews. This position requires frequent work with other SRE team members and interaction with teams outside of SRE.

Requirements

  • 5+ years of experience in an SRE/Operations/DevOps role
  • Experience with operating highly available infrastructure
  • Experience with running applications and services at scale
  • Experience implementing containerization solutions (Docker, Kubernetes)
  • Proficient with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
  • Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, TerraForm etc.)
  • Communicative technical English

Responsibilities

  • Design, implementation and maintenance of public facing infrastructure and services
  • Use of configuration management and deployment tools
  • Architectural design and operation at scale
  • Monitoring of systems and services, optimization of performance and resource utilization
  • Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective
  • Common operating system level tasks such as logging and backup / restore
  • Cookbook / runbook implementation for common maintenance actions
  • Participate in 24/7 on-call rotation and escalations for resolving production issues
  • Lead incident response and post-incident reviews, contributing to failure analysis and implementing preventive measures
  • Automation and streamlining of tasks as well as identifying process gaps
  • Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
  • Mentoring peers in your areas of technical and operational strength
  • Expected to travel domestically or potentially internationally 2-3 times in a year for team gatherings and conferences

Preferred Qualifications

  • Experience with package management for operating systems (Debian, etc)
  • We are avid supporters (and users) of open source software; history of contributing to Open Source projects is valued
  • Familiarity with RFC 2549
  • Prior participation in the Wikimedia movement

Benefits

  • The anticipated annual pay range of this position for applicants based within the United States is US$ 109,047 to US$ 169,455 with multiple individualized factors, including cost of living in the location, being the determinants of the offered pay
  • For applicants located outside of the US, the pay range will be adjusted to the country of hire
  • We neither ask for nor take into consideration the salary history of applicants

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.