Manager of Engineering, SRE

Logo of Platform Science

Platform Science

πŸ’΅ $134k-$200k
πŸ“Remote - United States

Job highlights

Summary

Join Platform Science as a Site Reliability Engineering (SRE) Manager and lead a high-performing team ensuring system reliability, scalability, and efficiency. You will coach the team, promote best practices, and enable development teams to deliver production-ready applications. This role involves overseeing multiple projects and initiatives while maintaining clear communication. The ideal candidate possesses 5+ years of software engineering or SRE experience, including 2+ years in a leadership position, and proven expertise with various technologies. Platform Science offers a comprehensive benefits package including medical, dental, vision, disability, life insurance, 401k, paid time off, and parental leave. The estimated base salary is between $134,550 and $200,000.

Requirements

  • 5+ years of experience in software engineering or SRE roles
  • 2+ years in a leadership or management position
  • Proven expertise with Kubernetes, ArgoCD, AWS, Prometheus, Grafana, Datadog, FluentD, Jenkins, and Docker
  • Strong knowledge of CI/CD and GitOps practices
  • Excellent verbal and written communication skills
  • Demonstrated ability to track and prioritize multiple projects, requests, and initiatives effectively
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience

Responsibilities

  • Recruit, train, and mentor a team of Site Reliability Engineers to deliver operational excellence
  • Foster a culture of innovation, collaboration, and adherence to SRE principles like SLOs, error budgets, and production readiness
  • Standardize and train development teams on observability tools such as Prometheus, Grafana, and Datadog
  • Enhance developer and release workflows using CI/CD best practices, GitOps methodologies, and tools like Jenkins, ArgoCD, and Docker
  • Drive application and system resilience through chaos engineering, load testing, and automation
  • Collaborate with teams to define SLIs, SLOs, and manage error budgets
  • Manage on-call rotation schedules, optimize alerting processes, and ensure 24/7 production application support
  • Serve as the escalation point for incident resolution, providing guidance and technical expertise
  • Build tools, dashboards, and processes to improve incident response, production health, and system reliability
  • Conduct quarterly "State of the Service" reviews to assess performance, sustainability, and risks
  • Track and prioritize multiple initiatives while ensuring the team stays focused and aligned with organizational goals
  • Maintain detailed documentation on team projects, requests, policies, and best practices
  • Communicate effectively across teams, departments, and stakeholders to ensure alignment and a clear understanding of SRE initiatives
  • Evangelize SRE practices across the organization and ensure consistent adoption of reliability-focused processes

Benefits

  • Medical, dental, and vision insurance
  • Short-term and long-term disability insurances
  • AD&D and life insurance
  • 401k plan
  • Paid vacation, sick leave and holidays
  • Six weeks of paid parental leave

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.