Site Reliability Engineer at WorkOS

Summary

Join WorkOS's Site Reliability Engineering (SRE) team and ensure the platform's speed, reliability, and resilience. As an early team member, you will shape the approach to reliability at scale and collaborate across the company. You will design and evolve systems, tooling, and processes to improve reliability and performance. Collaborate with product and infrastructure teams to ensure services are production-ready and resilient. Define and measure SLIs/SLOs to guide improvements, write and optimize backend systems in TypeScript, and improve incident response processes. Develop internal tools and automations, participate in on-call rotation, and contribute to design and architecture discussions. This role requires experience operating and scaling production systems in cloud environments and familiarity with service reliability concepts.

Requirements

Experience operating and scaling production systems in cloud environments (we use AWS)
Familiarity with service reliability concepts—monitoring, alerting, incident response, and root cause analysis
Comfort working across infrastructure layers (e.g. compute, networking, storage, observability tooling)
Strong debugging and systems thinking skills—you can follow problems across services and layers
Ability to work independently, take ownership, and drive projects from problem discovery through resolution

Responsibilities

Design and evolve the systems, tooling, and processes that improve the reliability and performance of WorkOS
Collaborate with product and infrastructure teams to ensure services are production-ready, observable, and resilient to failure
Define and measure SLIs/SLOs to guide reliability improvements
Write and optimize backend systems (in TypeScript) with a focus on performance, maintainability, and graceful degradation
Improve our incident response process, lead postmortems, and drive follow-through on reliability risks
Develop internal tools and automations that make it easier to operate and scale our systems
Participate in our on-call rotation—responding to, resolving, and learning from production incidents
Contribute to design and architecture discussions with a focus on operability and long-term sustainability
Document systems, share learnings, and help grow a reliability-minded engineering culture

Preferred Qualifications

Familiarity with Kubernetes or similar orchestration systems
Exposure to observability stacks (e.g. Prometheus, Grafana, Datadog, OpenTelemetry)
Exposure to TypeScript or interest in working in a TypeScript-based codebase

Benefits

Competitive pay
Substantial equity grants
Healthcare insurance (Medical, Dental and Vision) for you and your family
401k matching
Wellness and fitness monthly allowances
PTO + paid holidays + unlimited sick leave
Autonomy and flexibility with remote work

Site Reliability Engineer

WorkOS

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior