Staff/Senior Staff Site Reliability Engineer
closed
SandboxAQ
Summary
Join SandboxAQ as a Senior Staff Site Reliability Engineer and maintain and improve the reliability, performance, and scalability of our infrastructure and services. Collaborate with engineering teams to ensure system resilience and high availability. Lead incident response, root cause analysis, and capacity planning. Design and maintain monitoring and alerting solutions. Optimize infrastructure costs and build automation tools. Mentor junior engineers and participate in on-call rotation. This role requires 10+ years of experience in SRE, DevOps, or similar roles, strong cloud platform experience, and expertise in systems administration, networking, and security. The salary range is $183k-$280k per year, plus potential bonuses and equity.
Requirements
- 10+ years of experience in Site Reliability Engineering , DevOps, or similar roles
- Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation)
- Proven ability to lead post-incident reviews and drive continuous improvement in system reliability
- Excellent communication and collaboration skills, with the ability to work across cross-functional teams
- Expertise in systems administration, networking, and security in a cloud-native environment
- Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.)
- Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet)
- Experience designing and implementing scalable and reliable microservices architectures
- Strong knowledge of CI/CD pipelines and related tools ( CircleCI ,Jenkins, GitLab, etc)
Responsibilities
- Incident Management: Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times
- Capacity Planning: Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases
- Monitoring & Observability: Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies
- Collaboration with Engineering Teams: Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant
- Cost Optimization: Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance
- Automation & Tools Development: Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency
- Mentorship & Leadership: Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design
- On-Call Rotation: Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems
Preferred Qualifications
- Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL)
- Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures
- Strong understanding of compliance and security frameworks
- Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey)
Benefits
- Medical/dental/vision
- Family planning/fertility
- PTO (summer and winter breaks)
- Financial wellness resources
- 401(k) plans