Summary

Join SandboxAQ as a Senior Staff Site Reliability Engineer and contribute to maintaining and improving the reliability, performance, and scalability of our infrastructure and services. You will collaborate with engineering teams, lead incident response, and develop strategies to minimize incidents. Your expertise will guide the development of reliable software and shape the reliability culture. This role involves capacity planning, monitoring, cost optimization, automation, mentorship, and on-call responsibilities. The position requires extensive experience in SRE, DevOps, cloud platforms, and related technologies. SandboxAQ offers competitive salaries, stock options, generous learning opportunities, comprehensive benefits, and a commitment to employee growth.

Requirements

10+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation)
Proven ability to lead post-incident reviews and drive continuous improvement in system reliability
Excellent communication and collaboration skills, with the ability to work across cross-functional teams
Expertise in systems administration, networking, and security in a cloud-native environment
Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.)
Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet)
Experience designing and implementing scalable and reliable microservices architectures
Strong knowledge of CI/CD pipelines and related tools (CircleCI, Jenkins, GitLab, etc.)

Responsibilities

Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times
Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases
Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies
Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant
Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance
Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency
Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design
Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems

Preferred Qualifications

Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL)
Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures
Strong understanding of compliance and security frameworks
Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey)

Benefits

Annual discretionary bonuses
Equity
Competitive salaries
Stock options
Generous learning opportunities
Medical/dental/vision
Family planning/fertility
PTO (summer and winter breaks)
Financial wellness resources
401(k) plans

Senior Staff Site Reliability Engineer

SandboxAQ

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

Software Development

Mid-level

Kontakt.io

Remote

DevOps

Senior

Kontakt.io

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Acquird.io

Remote

DevOps

Senior

Remote

DevOps

Mid-level