Staff Site Reliability Engineer

Primer AI Logo

Primer AI

πŸ’΅ $180k-$230k
πŸ“Remote - United States

Summary

Join Primer, a company dedicated to making the world safer through AI, as a Staff Site Reliability Engineer. You will be a key member of the Infrastructure team, responsible for designing, building, and maintaining fault-tolerant systems. Collaborate with other teams to define and meet service level objectives, implement observability, and enhance engineering practices. Your expertise in areas like automation, incident management, and observability will be crucial. This role requires significant experience in production systems engineering, Linux administration, and various technologies. Primer offers competitive compensation and a comprehensive benefits package.

Requirements

  • 10+ years experience in production systems engineering, SRE, or DevOps roles supporting large-scale, mission-critical platforms
  • 10+ years experience with Linux systems administration and Bash/Linux scripting
  • 5+ years experience with observability tools (monitoring, logging, tracing) such as Datadog, New Relic, Prometheus, ELK, or similar
  • 5+ years experience with microservices architectures, Kubernetes, and CI/CD pipelines
  • 2+ years experience in at least one programming language (e.g., Python, Go) with a strong focus on building automation and tooling
  • Solid understanding of cloud networking (e.g., mesh networking, TCP/IP, DNS, load balancing, VPNs)

Responsibilities

  • Architect, Build, and Scale: Design and architect our solutions for continuous availability and scalability in production
  • Uphold Reliability Standards: Define and review Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Work with engineering teams to ensure new services or features meet reliability and performance targets
  • Drive Automation & Tooling: Develop tools, frameworks, and platforms to streamline repetitive tasks (e.g., monitoring, incident response). Write software that improves reliability and security (e.g., automated testing, canary deployments)
  • Incident Management & Postmortems: Participate in on-call (Livesite) rotations; lead and coordinate incident responses. Conduct thorough post-incident reviews, share learnings, and implement improvements to mitigate future occurrences
  • Observability Best Practices: Develop and maintain best-in-class monitoring, logging, and alerting systems to provide actionable insights into the health of infrastructure and services. Advise teams on instrumentation best practices, ensuring comprehensive coverage of critical paths and dependencies
  • Cross-Functional Collaboration: Work closely with product managers, software engineers, and security teams to deliver end-to-end solutions with reliability built in

Preferred Qualifications

  • Experience building or running distributed systems that include GPU heavy workloads or LLMs
  • Strong knowledge of the AWS platform with experience in cost optimization and capacity planning
  • Track record of leading incident response efforts and conducting detailed postmortems
  • Security awareness and familiarity with secure coding, encryption, and compliance best practices
  • Excellent communication skills, with the ability to explain complex topics to both technical and non-technical audiences

Benefits

  • Full medical, dental, and vision coverage
  • Fertility benefits through Carrot
  • Mental health coverage on demand with Headspace Care+
  • Gympass+ Membership via Wellhub
  • One Medical Membership
  • 401(k)
  • Remote work stipends
  • Monthly internet allowance
  • Flexible vacation policy
  • Wellness Days
  • 100% paid leave for parents of growing families

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.