Remote Staff Site Reliability Engineer at Platform Science

Summary

The job is for a Staff SRE in San Diego (or remote) to solve operational problems, support development teams, and ensure reliability of production services. The role involves leading the development of CI/CD pipelines, architecting Helm charts, establishing observability solutions, conducting Production Readiness Reviews, designing software solutions to address operational challenges, fulfilling on-call duties, improving resiliency using chaos engineering, etc.

Requirements

Possess 9+ years of hands-on experience in SRE or Platform Engineering roles
Demonstrated expertise (4+ years) with automation technologies like Jenkins, ArgoCD, or similar
Extensive (3+ years) experience with Kubernetes, Helm, and Docker within production environments
Proficiency with current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development
Experience with AWS, encompassing proficiency in EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
Proficient in Python, Bash, Nodejs, and/or Go
Proficient with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
Strong emphasis on documentation and fostering knowledge-sharing practices within the team and organization
Track record of successfully training and mentoring engineers
Proven expertise in optimizing performance and managing costs within cloud environments
Sound understanding of SLI/SLO concepts and adherence to SRE best practices

Responsibilities

Lead the development and enhancement of Continuous Integration/Continuous Deployment (CI/CD) pipelines
Architect and maintain Helm charts to streamline application deployment and management
Establish standardized observability solutions to empower development teams in efficiently managing their applications
Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
Improve the resiliency of applications and systems using chaos engineering

Benefits

Medical, dental, and vision insurance
Short-term and long-term disability insurances
AD&D and life insurance
401k plan
Paid vacation, sick leave and holidays
Six weeks of paid parental leave

Platform Science is hiring a Staff Site Reliability Engineer

Platform Science

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Similar Jobs

Staff Site Reliability Engineer

Gemini

Remote

DevOps

Mid-level

Staff Site Reliability Engineer

Earnin

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Senior or Staff Site Reliability Engineer

Circle

Remote

DevOps

Senior

Platform Science is hiring a
Staff Site Reliability Engineer