Staff Site Reliability Engineer
Gemini
Job highlights
Summary
Join Gemini's Platform team as a Staff Site Reliability Engineer and play a key role in leading engineering teams towards modern DevOps practices. You will provide operational support and engineering for Gemini services, improve reliability and time-to-market, guide engineering teams on platform services, and run performance evaluations. Responsibilities include architecture recommendations, creating production-ready scorecards, implementing monitoring best practices, defining SLIs/SLOs, and educating teams on reliability best practices. This role requires extensive experience with monitoring, alerting, automation, cloud technologies, containerization, configuration management, scripting, and performance analysis. Gemini offers a competitive compensation and benefits package, including a competitive salary, annual bonus, equity grant, comprehensive health plans, 401k matching, paid parental leave, and flexible time off.
Requirements
- 7+ years using monitoring, alerting, and automation tooling to understand and remediate performance and health issues in systems at scale
- Good knowledge for various cloud technology providers like AWS, GCP, or Azure
- Experience in a code-first environment, developing automated solutions to solve support and operational issues
- Experience as a Technical Leader within a team, helping evaluating and making tech decisions for the team
- Experience working with containerization such as Nomad, EKS (k8s), Docker, etc
- Experience working with Configuration Management such as Ansible, Chef, Puppet
- Experience writing scripts or cli tools that help increase Developer Productivity in high-level languages like Python, Go, etc
- Experience analyzing system and application performance, identifying bottlenecks, and recommending architectural or systemic improvements
- Experience working with Engineering teams, teaching, training, and mentoring on how to implement best-practice technical solutions
- Experience working in a code-drive, automation-first public cloud infrastructure (Terraform)
Responsibilities
- Provide primary operational support and engineering for various Gemini services
- Improve reliability, quality and time-to-market across all Gemini services and offerings
- Guide engineering teams onto the various supported services provided by Platform
- Run on-going performance evaluations and improvements for Gemini systems
- Provide architecture recommendations and engagement as part of SDLC
- Create “Production-ready Scorecards” to evaluate the health of systems pre-launch
- Implement and teaching monitoring, alerting and automated resolution best practices
- Define SLIs, SLOs with Engineering teams
- Educate and guide Engineering teams on reliability and resiliency best practices, like statelessness, chaos testing, blue/green deployments etc
- Build operational tooling and automations
Benefits
- Competitive starting salary
- A discretionary annual bonus
- Long-term incentive in the form of a new hire equity grant
- Comprehensive health plans
- 401K with company matching
- Paid Parental Leave
- Flexible time off
Share this job:
Similar Remote Jobs
- 💰$198k-$270k📍United States
- 📍United States
- 💰$148k-$204k📍United States
- 📍Europe
- 📍United States
- 💰$135k-$178k📍Worldwide
- 📍Brazil
- 📍Worldwide
- 📍Australia