Senior Staff Site Reliability Engineer

SandboxAQ Logo

SandboxAQ

πŸ’΅ $217k-$304k
πŸ“Remote - United States

Summary

Join SandboxAQ as a Senior Staff Site Reliability Engineer and contribute to maintaining and improving the reliability, performance, and scalability of our infrastructure and services. You will collaborate with engineering teams, lead incident response, and develop strategies to minimize incidents. Your expertise will guide the development of reliable software and shape the reliability culture. This role involves capacity planning, monitoring, cost optimization, automation, mentorship, and on-call responsibilities. The position requires extensive experience in SRE, DevOps, cloud platforms, and related technologies. SandboxAQ offers competitive salaries, stock options, generous learning opportunities, comprehensive benefits, and a commitment to employee growth.

Requirements

  • 10+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation)
  • Proven ability to lead post-incident reviews and drive continuous improvement in system reliability
  • Excellent communication and collaboration skills, with the ability to work across cross-functional teams
  • Expertise in systems administration, networking, and security in a cloud-native environment
  • Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.)
  • Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet)
  • Experience designing and implementing scalable and reliable microservices architectures
  • Strong knowledge of CI/CD pipelines and related tools (CircleCI, Jenkins, GitLab, etc.)

Responsibilities

  • Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times
  • Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases
  • Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies
  • Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant
  • Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance
  • Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency
  • Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design
  • Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems

Preferred Qualifications

  • Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL)
  • Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures
  • Strong understanding of compliance and security frameworks
  • Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey)

Benefits

  • Annual discretionary bonuses
  • Equity
  • Competitive salaries
  • Stock options
  • Generous learning opportunities
  • Medical/dental/vision
  • Family planning/fertility
  • PTO (summer and winter breaks)
  • Financial wellness resources
  • 401(k) plans

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs