Summary

Join our team as Manager, Site Reliability Engineering and lead a team of skilled Site Reliability Engineers responsible for the design, deployment, and maintenance of our production systems. You will play a crucial role in ensuring the reliability, scalability, and performance of our infrastructure, as well as driving continuous improvement initiatives.

Requirements

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
Proven experience as a Site Reliability Engineer or similar role, with at least 3-5 years of hands-on experience in managing production systems
Strong expertise in the listed technologies: Ansible, Concourse CI, Jenkins, Github Actions, EKS (Kubernetes), Linux Administration, terraform
Demonstrated experience in leading and managing a team of technical professionals for at least 2 years
Solid understanding of SRE principles, including reliability, scalability, availability, and performance
Proficient in scripting and automation (e.g., Python, Bash, or similar)
Experience with infrastructure-as-code (IaC) tools, configuration management, and CI/CD pipelines
Knowledge of cloud platforms (e.g., AWS, Azure, or Google Cloud) and containerization technologies (e.g., Docker)
Excellent problem-solving skills and the ability to thrive in a fast-paced, dynamic environment
Strong communication and leadership skills, with the ability to collaborate effectively with both technical and non-technical stakeholders

Responsibilities

Lead, mentor and manage a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and operational excellence
Develop, communicate, and execute the SRE team's strategic goals, objectives, and roadmap in alignment with the overall business objectives
Oversee the design, implementation, and maintenance of highly available and scalable production systems
Drive continuous improvement initiatives by identifying areas for enhancement and implementing best practices, automation, and process improvements
Collaborate with cross-functional teams and Departments to ensure smooth integration of applications and systems
Define and enforce Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure system reliability and uptime
Monitor system performance, troubleshoot issues, and ensure timely incident response, root cause analysis, and problem resolution
Implement effective monitoring, logging, and alerting systems to proactively identify and mitigate potential issues
Stay up-to-date with industry trends, emerging technologies, and best practices related to SRE and DevOps, and apply them to improve operational efficiency
Identify potential risks to system reliability and implement strategies to mitigate them
Ensure that all systems and processes comply with relevant regulations, standards, and best practices

Preferred Qualifications

Relevant certifications, such as Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer
Experience with monitoring and observability tools (e.g., Datadog, New Relic, Prometheus, Grafana, ELK Stack)

Benefits

Competitive starting salary
A discretionary annual bonus
Long-term incentive in the form of a new hire equity grant
Comprehensive health plans
401K with company matching
Paid Parental Leave
Flexible time off

Remote Manager, Site Reliability Engineering

Gemini

Job highlights

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Manager

Share this job:

Similar Remote Jobs

Site Reliability Engineering Manager

Experian

Remote

DevOps

Manager

Site Reliability Engineering Manager

Canonical

Remote

DevOps

Manager

Site Reliability Engineering Manager, Security

Klaviyo

Remote

Cybersecurity

Manager

Site Reliability Engineer, Manager

SingleStore

Remote

DevOps

Mid-level

Senior Site Reliability Engineering Manager

Sumo Logic

Remote

DevOps

Manager

Engineering Manager, Reliability Engineering

Airbnb

Remote

Software Development

Manager

Senior Infrastructure Engineer, Site Reliability Engineer

Flex

Remote

DevOps

Senior

Staff Software Engineer, Site Reliability Engineer

Fieldwire by Hilti

Remote

Software Development

Mid-level

Staff Software Engineer, Site Reliability Engineer

Babylist

Remote

Software Development

Mid-level