Platform Science is hiring a
Senior Site Reliability Engineer, Remote - United States

Logo of Platform Science

Senior Site Reliability Engineer

🏢 Platform Science

💵 $109k-$176k
📍United States

Summary

The job is for a Senior SRE in San Diego, CA (or remote) to solve operational problems and support development teams for critical business applications. The role involves ensuring reliability of production services and enabling dev teams to measure their reliability effectively. The SRE team works with various technologies and products.

Requirements

  • Possess 5+ years of hands-on experience in SRE or Platform Engineering roles
  • Demonstrated expertise (2+ years) with automation technologies like Jenkins, ArgoCD, or similar
  • Experience with Kubernetes (2+ years), Helm, and Docker within production environments
  • Proficiency with current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development
  • Experience with AWS, encompassing proficiency in EKS, IAM, autoscaling, networking, and load balancing/request routing in a production environment
  • Proficient in Python, Bash, Nodejs, and/or Go
  • Proficient with distributed tracing methodologies and observability tools such as Prometheus, ELK, or Datadog
  • Strong emphasis on documentation and fostering knowledge-sharing practices within the team and organization
  • Track record of successfully training and mentoring engineers
  • Proven expertise in optimizing performance and managing costs within cloud environments
  • Sound understanding of SLI/SLO concepts and adherence to SRE best practices
  • Bachelors in Computer Science or related field

Responsibilities

  • Develop and enhance Continuous Integration/Continuous Deployment (CI/CD) pipelines
  • Refine release management processes and associated toolsets
  • Maintain Helm charts to streamline application deployment and management
  • Establish standardized observability solutions to empower development teams in efficiently managing their applications
  • Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals, and mentoring colleagues in SRE best practices
  • Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Indicators and Service Level Objectives (SLIs/SLOs), and ensure high-quality and dependable services
  • Design and develop software solutions to address operational challenges effectively to improve system stability and reliability
  • Fulfill on-call duties, providing expert support to development teams for mission-critical applications in production environments
  • Improve the resiliency of applications and systems using chaos engineering

Benefits

  • Medical, dental, and vision insurance
  • Short-term and long-term disability insurances
  • AD&D and life insurance
  • 401k plan
  • Paid vacation, sick leave and holidays
  • Six weeks of paid parental leave

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Jobs

Please let Platform Science know you found this job on JobsCollider. Thanks! 🙏