Manager of Engineering, SRE

closed
Platform Science Logo

Platform Science

πŸ’΅ $134k-$200k
πŸ“Remote - United States

Summary

Join Platform Science as a Site Reliability Engineering (SRE) Manager and lead a high-performing team ensuring system reliability, scalability, and efficiency. You will coach the team, promote best practices, and enable development teams to deliver production-ready applications. This role involves overseeing multiple projects and initiatives while maintaining clear communication. The ideal candidate possesses 5+ years of software engineering or SRE experience, including 2+ years in a leadership position, and proven expertise with various technologies. Platform Science offers a comprehensive benefits package including medical, dental, vision, disability, life insurance, 401k, paid time off, and parental leave. The estimated base salary is between $134,550 and $200,000.

Requirements

  • 5+ years of experience in software engineering or SRE roles
  • 2+ years in a leadership or management position
  • Proven expertise with Kubernetes, ArgoCD, AWS, Prometheus, Grafana, Datadog, FluentD, Jenkins, and Docker
  • Strong knowledge of CI/CD and GitOps practices
  • Excellent verbal and written communication skills
  • Demonstrated ability to track and prioritize multiple projects, requests, and initiatives effectively
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience

Responsibilities

  • Recruit, train, and mentor a team of Site Reliability Engineers to deliver operational excellence
  • Foster a culture of innovation, collaboration, and adherence to SRE principles like SLOs, error budgets, and production readiness
  • Standardize and train development teams on observability tools such as Prometheus, Grafana, and Datadog
  • Enhance developer and release workflows using CI/CD best practices, GitOps methodologies, and tools like Jenkins, ArgoCD, and Docker
  • Drive application and system resilience through chaos engineering, load testing, and automation
  • Collaborate with teams to define SLIs, SLOs, and manage error budgets
  • Manage on-call rotation schedules, optimize alerting processes, and ensure 24/7 production application support
  • Serve as the escalation point for incident resolution, providing guidance and technical expertise
  • Build tools, dashboards, and processes to improve incident response, production health, and system reliability
  • Conduct quarterly "State of the Service" reviews to assess performance, sustainability, and risks
  • Track and prioritize multiple initiatives while ensuring the team stays focused and aligned with organizational goals
  • Maintain detailed documentation on team projects, requests, policies, and best practices
  • Communicate effectively across teams, departments, and stakeholders to ensure alignment and a clear understanding of SRE initiatives
  • Evangelize SRE practices across the organization and ensure consistent adoption of reliability-focused processes

Benefits

  • Medical, dental, and vision insurance
  • Short-term and long-term disability insurances
  • AD&D and life insurance
  • 401k plan
  • Paid vacation, sick leave and holidays
  • Six weeks of paid parental leave
This job is filled or no longer available