Site Reliability Engineer - Storage Engineer

GoDaddy
Summary
Join GoDaddy's dynamic team as a Site Reliability Engineer (SRE) specializing in automating and maintaining storage infrastructure using Ceph. This remote position requires ensuring the reliability, scalability, and performance of our systems. You will automate day-to-day storage system operations, develop and maintain automation tools, monitor system performance, and implement solutions for high availability. The role involves participation in agile methodologies and continuous improvement of system reliability. GoDaddy offers a range of benefits, including paid time off, retirement savings options, bonuses, health benefits, and parental leave. The company embraces diversity and inclusion, fostering a supportive and collaborative work environment.
Requirements
- 2+ years of experience in site reliability engineering or a similar role
- Proficiency in working with Ceph, including deployment, configuration, and management of Ceph clusters and systems
- 1+ years of professional experience with Ceph
- Experience working on Linux/Unix systems, with a focus on automation and operating at scale
- Proficiency in Python or Bash
- Experience with Ansible, Terraform, or SaltStack
- Experience with Nagios-based monitoring tools, such as Icinga2
- Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki
- Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems
- Experience with Agile concepts and methodologies, including participation in Scrum or Kanban teams, and familiarity with Agile tools and practices
- Demonstrates solid analytical and troubleshooting skills, with the ability to resolve moderately complex issues in distributed systems with guidance when needed
- Communicates clearly and works well within a team environment, contributing to collaboration and knowledge sharing with guidance when needed
Responsibilities
- Automate and maintain day-to-day operations of storage systems to support application demands
- Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency
- Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability
- Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment
- Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization
Preferred Qualifications
- Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
- Exposure to and experience working with compute platforms (e.g., OpenStack, AWS)
- Familiarity with ability to contribute to CI/CD pipelines and automation workflows
Benefits
- Paid time off
- Retirement savings (e.g., 401k, pension schemes)
- Bonus/incentive eligibility
- Equity grants
- Participation in our employee stock purchase plan
- Competitive health benefits
- Other family-friendly benefits including parental leave