Site Reliability Engineer

WorkOS
Summary
Join WorkOS's Site Reliability Engineering (SRE) team and ensure the platform's speed, reliability, and resilience. As an early team member, you will shape the approach to reliability at scale and collaborate across the company. You will design and evolve systems, tooling, and processes to improve reliability and performance. Collaborate with product and infrastructure teams to ensure services are production-ready and resilient. Define and measure SLIs/SLOs to guide improvements, write and optimize backend systems in TypeScript, and improve incident response processes. Develop internal tools and automations, participate in on-call rotation, and contribute to design and architecture discussions. This role requires experience operating and scaling production systems in cloud environments and familiarity with service reliability concepts.
Requirements
- Experience operating and scaling production systems in cloud environments (we use AWS)
- Familiarity with service reliability concepts—monitoring, alerting, incident response, and root cause analysis
- Comfort working across infrastructure layers (e.g. compute, networking, storage, observability tooling)
- Strong debugging and systems thinking skills—you can follow problems across services and layers
- Ability to work independently, take ownership, and drive projects from problem discovery through resolution
Responsibilities
- Design and evolve the systems, tooling, and processes that improve the reliability and performance of WorkOS
- Collaborate with product and infrastructure teams to ensure services are production-ready, observable, and resilient to failure
- Define and measure SLIs/SLOs to guide reliability improvements
- Write and optimize backend systems (in TypeScript) with a focus on performance, maintainability, and graceful degradation
- Improve our incident response process, lead postmortems, and drive follow-through on reliability risks
- Develop internal tools and automations that make it easier to operate and scale our systems
- Participate in our on-call rotation—responding to, resolving, and learning from production incidents
- Contribute to design and architecture discussions with a focus on operability and long-term sustainability
- Document systems, share learnings, and help grow a reliability-minded engineering culture
Preferred Qualifications
- Familiarity with Kubernetes or similar orchestration systems
- Exposure to observability stacks (e.g. Prometheus, Grafana, Datadog, OpenTelemetry)
- Exposure to TypeScript or interest in working in a TypeScript-based codebase
Benefits
- Competitive pay
- Substantial equity grants
- Healthcare insurance (Medical, Dental and Vision) for you and your family
- 401k matching
- Wellness and fitness monthly allowances
- PTO + paid holidays + unlimited sick leave
- Autonomy and flexibility with remote work