Gemini is hiring a
Manager, Site Reliability Engineering

Logo of Gemini

Gemini

πŸ’΅ $172k-$215k
πŸ“Remote - United States

Summary

Join our team as Manager, Site Reliability Engineering and lead a team of skilled Site Reliability Engineers responsible for the design, deployment, and maintenance of our production systems. You will play a crucial role in ensuring the reliability, scalability, and performance of our infrastructure, as well as driving continuous improvement initiatives.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • Proven experience as a Site Reliability Engineer or similar role, with at least 3-5 years of hands-on experience in managing production systems
  • Strong expertise in the listed technologies: Ansible, Concourse CI, Jenkins, Github Actions, EKS (Kubernetes), Linux Administration, terraform
  • Demonstrated experience in leading and managing a team of technical professionals for at least 2 years
  • Solid understanding of SRE principles, including reliability, scalability, availability, and performance
  • Proficient in scripting and automation (e.g., Python, Bash, or similar)
  • Experience with infrastructure-as-code (IaC) tools, configuration management, and CI/CD pipelines
  • Knowledge of cloud platforms (e.g., AWS, Azure, or Google Cloud) and containerization technologies (e.g., Docker)
  • Excellent problem-solving skills and the ability to thrive in a fast-paced, dynamic environment
  • Strong communication and leadership skills, with the ability to collaborate effectively with both technical and non-technical stakeholders

Responsibilities

  • Lead, mentor and manage a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and operational excellence
  • Develop, communicate, and execute the SRE team's strategic goals, objectives, and roadmap in alignment with the overall business objectives
  • Oversee the design, implementation, and maintenance of highly available and scalable production systems
  • Drive continuous improvement initiatives by identifying areas for enhancement and implementing best practices, automation, and process improvements
  • Collaborate with cross-functional teams and Departments to ensure smooth integration of applications and systems
  • Define and enforce Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure system reliability and uptime
  • Monitor system performance, troubleshoot issues, and ensure timely incident response, root cause analysis, and problem resolution
  • Implement effective monitoring, logging, and alerting systems to proactively identify and mitigate potential issues
  • Stay up-to-date with industry trends, emerging technologies, and best practices related to SRE and DevOps, and apply them to improve operational efficiency
  • Identify potential risks to system reliability and implement strategies to mitigate them
  • Ensure that all systems and processes comply with relevant regulations, standards, and best practices

Preferred Qualifications

  • Relevant certifications, such as Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer
  • Experience with monitoring and observability tools (e.g., Datadog, New Relic, Prometheus, Grafana, ELK Stack)

Benefits

  • Competitive starting salary
  • A discretionary annual bonus
  • Long-term incentive in the form of a new hire equity grant
  • Comprehensive health plans
  • 401K with company matching
  • Paid Parental Leave
  • Flexible time off

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Jobs

Please let Gemini know you found this job on JobsCollider. Thanks! πŸ™