Platform Science is hiring a
Staff Site Reliability Engineer

closed
Logo of Platform Science

Platform Science

πŸ’΅ $145k-$227k
πŸ“Remote - United States

Summary

The job is for a Staff SRE in San Diego (or remote) to solve operational problems, support development teams, and ensure reliability of production services. The role involves leading the development of CI/CD pipelines, architecting Helm charts, establishing observability solutions, conducting Production Readiness Reviews, designing software solutions to address operational challenges, fulfilling on-call duties, improving resiliency using chaos engineering, etc.

Requirements

  • Possess 9+ years of hands-on experience in SRE or Platform Engineering roles
  • Demonstrated expertise (4+ years) with automation technologies like Jenkins, ArgoCD, or similar
  • Extensive (3+ years) experience with Kubernetes, Helm, and Docker within production environments
  • Proficiency with current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development
  • Experience with AWS, encompassing proficiency in EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
  • Proficient in Python, Bash, Nodejs, and/or Go
  • Proficient with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
  • Strong emphasis on documentation and fostering knowledge-sharing practices within the team and organization
  • Track record of successfully training and mentoring engineers
  • Proven expertise in optimizing performance and managing costs within cloud environments
  • Sound understanding of SLI/SLO concepts and adherence to SRE best practices

Responsibilities

  • Lead the development and enhancement of Continuous Integration/Continuous Deployment (CI/CD) pipelines
  • Architect and maintain Helm charts to streamline application deployment and management
  • Establish standardized observability solutions to empower development teams in efficiently managing their applications
  • Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
  • Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
  • Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
  • Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
  • Improve the resiliency of applications and systems using chaos engineering

Benefits

  • Medical, dental, and vision insurance
  • Short-term and long-term disability insurances
  • AD&D and life insurance
  • 401k plan
  • Paid vacation, sick leave and holidays
  • Six weeks of paid parental leave
This job is filled or no longer available

Similar Jobs