Staff/Senior Staff Site Reliability Engineer at SandboxAQ

Summary

Join SandboxAQ as a Senior Staff Site Reliability Engineer and maintain and improve the reliability, performance, and scalability of our infrastructure and services. Collaborate with engineering teams to ensure system resilience and high availability. Lead incident response, root cause analysis, and capacity planning. Design and maintain monitoring and alerting solutions. Optimize infrastructure costs and build automation tools. Mentor junior engineers and participate in on-call rotation. This role requires 10+ years of experience in SRE, DevOps, or similar roles, strong cloud platform experience, and expertise in systems administration, networking, and security. The salary range is $183k-$280k per year, plus potential bonuses and equity.

Requirements

10+ years of experience in Site Reliability Engineering , DevOps, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation)
Proven ability to lead post-incident reviews and drive continuous improvement in system reliability
Excellent communication and collaboration skills, with the ability to work across cross-functional teams
Expertise in systems administration, networking, and security in a cloud-native environment
Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.)
Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet)
Experience designing and implementing scalable and reliable microservices architectures
Strong knowledge of CI/CD pipelines and related tools ( CircleCI ,Jenkins, GitLab, etc)

Responsibilities

Incident Management: Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times
Capacity Planning: Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases
Monitoring & Observability: Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies
Collaboration with Engineering Teams: Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant
Cost Optimization: Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance
Automation & Tools Development: Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency
Mentorship & Leadership: Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design
On-Call Rotation: Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems

Preferred Qualifications

Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL)
Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures
Strong understanding of compliance and security frameworks
Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey)

Benefits

Medical/dental/vision
Family planning/fertility
PTO (summer and winter breaks)
Financial wellness resources
401(k) plans

Staff/Senior Staff Site Reliability Engineer

SandboxAQ

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior