Manager of Engineering, SRE
Platform Science
Job highlights
Summary
Join Platform Science as a Site Reliability Engineering (SRE) Manager and lead a high-performing team ensuring system reliability, scalability, and efficiency. You will coach the team, promote best practices, and enable development teams to deliver production-ready applications. This role involves overseeing multiple projects and initiatives while maintaining clear communication. The ideal candidate possesses 5+ years of software engineering or SRE experience, including 2+ years in a leadership position, and proven expertise with various technologies. Platform Science offers a comprehensive benefits package including medical, dental, vision, disability, life insurance, 401k, paid time off, and parental leave. The estimated base salary is between $134,550 and $200,000.
Requirements
- 5+ years of experience in software engineering or SRE roles
- 2+ years in a leadership or management position
- Proven expertise with Kubernetes, ArgoCD, AWS, Prometheus, Grafana, Datadog, FluentD, Jenkins, and Docker
- Strong knowledge of CI/CD and GitOps practices
- Excellent verbal and written communication skills
- Demonstrated ability to track and prioritize multiple projects, requests, and initiatives effectively
- Bachelorβs degree in Computer Science, Engineering, or equivalent experience
Responsibilities
- Recruit, train, and mentor a team of Site Reliability Engineers to deliver operational excellence
- Foster a culture of innovation, collaboration, and adherence to SRE principles like SLOs, error budgets, and production readiness
- Standardize and train development teams on observability tools such as Prometheus, Grafana, and Datadog
- Enhance developer and release workflows using CI/CD best practices, GitOps methodologies, and tools like Jenkins, ArgoCD, and Docker
- Drive application and system resilience through chaos engineering, load testing, and automation
- Collaborate with teams to define SLIs, SLOs, and manage error budgets
- Manage on-call rotation schedules, optimize alerting processes, and ensure 24/7 production application support
- Serve as the escalation point for incident resolution, providing guidance and technical expertise
- Build tools, dashboards, and processes to improve incident response, production health, and system reliability
- Conduct quarterly "State of the Service" reviews to assess performance, sustainability, and risks
- Track and prioritize multiple initiatives while ensuring the team stays focused and aligned with organizational goals
- Maintain detailed documentation on team projects, requests, policies, and best practices
- Communicate effectively across teams, departments, and stakeholders to ensure alignment and a clear understanding of SRE initiatives
- Evangelize SRE practices across the organization and ensure consistent adoption of reliability-focused processes
Benefits
- Medical, dental, and vision insurance
- Short-term and long-term disability insurances
- AD&D and life insurance
- 401k plan
- Paid vacation, sick leave and holidays
- Six weeks of paid parental leave
Share this job:
Similar Remote Jobs
- πCanada
- π°$154k-$227kπUnited States
- πGermany
- π°$110k-$146kπCanada
- πUnited Kingdom, Germany
- πCanada
- π°$115k-$204kπUnited States
- πGermany