Site Reliability Engineer

StarCompliance
Summary
Join Starcompliance as a Site Reliability Engineer (SRE) and play a pivotal role in modernizing our platform. You will lead the evolution from legacy systems to modern, scalable microservices, focusing on application-level observability, autoscaling, and progressive delivery. Collaborate with cross-functional teams to design, build, and implement next-generation SRE practices and tools. This role offers the opportunity to make a significant impact on our platform's reliability and scalability as we grow to support thousands of customers and millions of end users. You will champion reliability by design, lead observability overhauls, and develop auto-scaling strategies. This is a foundational role in a company-wide modernization initiative.
Requirements
- 5+ years in SRE, DevOps, or Production Engineering roles, ideally within a SaaS or cloud-native environment
- Deep experience with cloud platforms (preferably Azure or AWS), and Infrastructure-as-Code tools (e.g. Terraform)
- Proficiency with observability tools such as New Relic, Datadog, Prometheus, or similar
- Strong understanding of software deployment strategies, CI/CD pipelines, and release engineering
- Ability to code in at least one modern scripting or systems language (e.g., Python,PowerShell, Go, Bash)
- Experience operating multi-tenant environments with an emphasis on security, performance, and cost optimization
- Excellent communicator who thrives in cross-functional settings and can influence engineering culture around reliability
Responsibilities
- Champion Reliability by Design : Collaborate with architects and engineers to build resilient, fault-tolerant systems across our evolving cloud-native stack
- Observability Overhaul : Lead the charge on full-stack observability, leveraging modern APM tooling, meaningful SLOs/SLIs, and actionable alerts
- Scaling Systems : Develop and implement auto-scaling strategies, load testing plans, and capacity forecasting for multi-tenant environments
- Progressive Delivery : Help implement and automate deployment strategies such as canary releases, feature flags, and blue/green rollouts
- Incident Response : Create and refine on-call processes, incident response playbooks, and blameless post-mortem routines
- Monitoring & Tooling : Own and evolve our monitoring infrastructure, integrating metrics, logs, and traces into a cohesive ecosystem
- Developer Empowerment : Build reusable templates, dashboards, and platform tooling to empower dev teams to βshift leftβ on reliability
- Cross-functional Collaboration : Work hand-in-hand with Infrastructure, Architecture, Support, and Engineering teams to drive shared accountability for uptime and performance
Preferred Qualifications
- Hands-on experience with Azure DevOps is strongly preferred, as our CI/CD and project workflows are fully built around it
- Experience in regulated industries (e.g., financial services, healthcare)
- Background with service mesh architectures, distributed tracing, and gRPC/GraphQL
- Familiarity with incident management platforms (e.g., PagerDuty, OpsGenie)
- Contributions to open-source SRE tooling or frameworks