Summary
Join the Wikimedia Foundation as a Senior Site Reliability Engineer, Databases and be part of a small team ensuring the health of our database systems. As a key member, you will support development and deployment, troubleshoot issues, automate tasks, plan for disaster recovery, and maintain backups.
Requirements
- Proficient at automation/programming/scripting skills
 - Experience with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.), as well as modern observability infrastructure (Prometheus, Grafana, Logstash/Kibana, Icinga/Nagios, etc.)
 - Advanced knowledge of Linux and IO/data storage concepts, internals and troubleshooting
 - Experience with managing remotely both bare-metal servers and virtualized environments
 - 5+ years experience in an SRE/Operations/DevOps role as part of a team
 - Experience with high traffic and highly available website architectures and operations
 - Strong English language skills
 - Ability to work independently in a fast paced environment, as an effective part of a globally distributed team, including ticket tracking systems and asynchronous communication tools
 - B.Sc. or M.Sc. in Computer Science or equivalent work experience
 
Responsibilities
- Operation, maintenance, troubleshooting and automation of relational database systems in production and staging environments
 - Handling configuration management, (Debian) package maintenance, patching and building, working with upstream on bug identification and resolution
 - Improving observability (alerting, metrics, monitoring) of database infrastructure
 - Multi-datacenter systems design, capacity and infrastructure planning
 - Taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimediaβs production infrastructure and participating in an on call rotation
 - Sharing our values and work in accordance with them
 
Benefits
- Health insurance
 - Retirement benefits
 - Paid time off
 - Remote work, flexible hours