Platform Science is hiring a
Staff Site Reliability Engineer
closedPlatform Science
π΅ $145k-$227k
πRemote - United States
Summary
The job is for a Staff SRE in San Diego (or remote) to solve operational problems, support development teams, and ensure reliability of production services. The role involves leading the development of CI/CD pipelines, architecting Helm charts, establishing observability solutions, conducting Production Readiness Reviews, designing software solutions to address operational challenges, fulfilling on-call duties, improving resiliency using chaos engineering, etc.
Requirements
- Possess 9+ years of hands-on experience in SRE or Platform Engineering roles
- Demonstrated expertise (4+ years) with automation technologies like Jenkins, ArgoCD, or similar
- Extensive (3+ years) experience with Kubernetes, Helm, and Docker within production environments
- Proficiency with current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development
- Experience with AWS, encompassing proficiency in EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
- Proficient in Python, Bash, Nodejs, and/or Go
- Proficient with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
- Strong emphasis on documentation and fostering knowledge-sharing practices within the team and organization
- Track record of successfully training and mentoring engineers
- Proven expertise in optimizing performance and managing costs within cloud environments
- Sound understanding of SLI/SLO concepts and adherence to SRE best practices
Responsibilities
- Lead the development and enhancement of Continuous Integration/Continuous Deployment (CI/CD) pipelines
- Architect and maintain Helm charts to streamline application deployment and management
- Establish standardized observability solutions to empower development teams in efficiently managing their applications
- Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
- Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
- Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
- Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
- Improve the resiliency of applications and systems using chaos engineering
Benefits
- Medical, dental, and vision insurance
- Short-term and long-term disability insurances
- AD&D and life insurance
- 401k plan
- Paid vacation, sick leave and holidays
- Six weeks of paid parental leave
This job is filled or no longer available
Similar Jobs
- π°$172k-$215kπUnited States
- π°$206k-$252kπUnited States
- π°$147k-$227kπUnited States
- π°$147k-$227kπUnited States
- π°~$150k-$222kπCanada
- π°$147k-$227kπUnited States
- π°$147k-$227kπUnited States
- π°$147k-$227kπUnited States
- π°$147k-$227kπUnited States
- π°$147k-$227kπUnited States