Senior Site Reliability Engineer

MagicSchool AI Logo

MagicSchool AI

📍Remote - Worldwide

Summary

Join MagicSchool, a leading generative AI platform for teachers, as a Senior Site Reliability Engineer (Observability & Resilience). Lead observability across the platform and design resilient infrastructure. Drive instrumentation and telemetry strategy, partnering with product and engineering teams. Define and maintain SLIs and SLOs, establish best practices for alert tuning, and architect infrastructure prioritizing high availability. Collaborate with engineers to embed resilient design and observability. Provide training and support to product engineers. This hands-on role requires at least 5 years of experience in SRE, DevOps, or observability, expertise in observability tools (Grafana, Prometheus, etc.), proficiency with Terraform, and strong communication skills.

Requirements

  • At least 5 years in an SRE, DevOps, or observability-focused role, with a track record of success in fast-paced, high-growth environments
  • Experience designing and operating systems for high availability and disaster recovery
  • Familiarity with incident response, alert fatigue reduction, and signal-to-noise balancing
  • Deep experience with observability tools such as Grafana, Prometheus, Loki, Datadog, and OpenTelemetry
  • Proven ability to operationalize these tools for maximum team impact
  • Strong proficiency with Terraform and infrastructure-as-code workflows
  • Experience with multi-cloud deployments and operating resilient systems at scale
  • Passion for enabling product engineers through training and pairing on observability patterns
  • Ability to drive cross-functional initiatives that improve system health and team effectiveness
  • Skilled at explaining complex infrastructure and observability concepts to both technical and non-technical audiences
  • Calm and decisive under pressure, especially during incident response

Responsibilities

  • Design and implement observability patterns—including metrics, logging, tracing, and alerting—to ensure we have clear, actionable visibility into platform behavior and performance
  • Build internal tooling and dashboards
  • Empower our teams with real-time system insights
  • Define and maintain SLIs and SLOs in partnership with product and engineering teams
  • Establish best practices for alert tuning and signal-to-noise balancing to reduce incident fatigue and improve response accuracy
  • Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation
  • Leverage Terraform and infrastructure-as-code to ensure consistent, reliable deployments across AWS and Google Cloud
  • Collaborate with engineers across teams to embed resilient design and observability from the ground up
  • Provide training and pairing support to product engineers, helping them build and maintain telemetry that supports the full software lifecycle

Preferred Qualifications

  • Experience with Sentinel, Loki, or similar logging/metrics stacks
  • Exposure to educational or compliance-heavy environments
  • Strong debugging skills and a calm presence during incidents

Benefits

  • Flexibility of working from home, while fostering a unique culture built on relationships, trust, communication, and collaboration with our team - no matter where they live
  • Unlimited time off to empower our employees to manage their work-life balance
  • Choice of employer-paid health insurance plans so that you can take care of yourself and your family
  • Dental and vision are also offered at very low premiums
  • Every employee is offered generous stock options, vested over 4 years
  • Plus a 401k match & monthly wellness stipend

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.