Site Reliability Engineer, Platform at Gemini -

Summary

Join Gemini's Platform team as a Staff Site Reliability Engineer and play a key role in leading engineering teams toward modern DevOps practices. You will provide operational support and engineering for Gemini services, improve reliability and time-to-market, guide engineering teams on platform services, and conduct performance evaluations. Responsibilities include architecture recommendations, creating production-ready scorecards, implementing monitoring and alerting best practices, defining SLIs/SLOs, and educating teams on reliability best practices. You will also build operational tooling and automations. This role requires a hybrid work model with in-person presence twice a week in either Seattle, WA or New York City, NY.

Requirements

7+ years using monitoring, alerting, and automation tooling to understand and remediate performance and health issues in systems at scale
Good knowledge for various cloud technology providers like AWS, GCP, or Azure
Experience in a code-first environment, developing automated solutions to solve support and operational issues
Experience as a Technical Leader within a team, helping evaluating and making tech decisions for the team
Experience working with containerization such as Nomad, EKS (k8s), Docker, etc
Experience working with Configuration Management such as Ansible, Chef, Puppet
Experience writing scripts or cli tools that help increase Developer Productivity in high-level languages like Python, Go, etc
Experience analyzing system and application performance, identifying bottlenecks, and recommending architectural or systemic improvements
Experience working with Engineering teams, teaching, training, and mentoring on how to implement best-practice technical solutions
Experience working in a code-drive, automation-first public cloud infrastructure (Terraform)

Responsibilities

Provide primary operational support and engineering for various Gemini services
Improve reliability, quality and time-to-market across all Gemini services and offerings
Guide engineering teams onto the various supported services provided by Platform
Run on-going performance evaluations and improvements for Gemini systems
Provide architecture recommendations and engagement as part of SDLC
Create “Production-ready Scorecards” to evaluate the health of systems pre-launch
Implement and teaching monitoring, alerting and automated resolution best practices
Define SLIs, SLOs with Engineering teams
Educate and guide Engineering teams on reliability and resiliency best practices, like statelessness, chaos testing, blue/green deployments etc
Build operational tooling and automations

Benefits

Competitive starting salary
A discretionary annual bonus
Long-term incentive in the form of a new hire equity grant
Comprehensive health plans
401K with company matching
Paid Parental Leave
Flexible time off

Site Reliability Engineer, Platform

Gemini

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level