Senior Site Reliability Engineer at MagicSchool AI

Summary

Join MagicSchool, a leading generative AI platform for teachers, as a Senior Site Reliability Engineer (Observability & Resilience). Lead observability across the platform and design resilient infrastructure. Drive instrumentation and telemetry strategy, partnering with product and engineering teams. Define and maintain SLIs and SLOs, establish best practices for alert tuning, and architect infrastructure prioritizing high availability. Collaborate with engineers to embed resilient design and observability. Provide training and support to product engineers. This hands-on role requires at least 5 years of experience in SRE, DevOps, or observability, expertise in observability tools (Grafana, Prometheus, etc.), proficiency with Terraform, and strong communication skills.

Requirements

At least 5 years in an SRE, DevOps, or observability-focused role, with a track record of success in fast-paced, high-growth environments
Experience designing and operating systems for high availability and disaster recovery
Familiarity with incident response, alert fatigue reduction, and signal-to-noise balancing
Deep experience with observability tools such as Grafana, Prometheus, Loki, Datadog, and OpenTelemetry
Proven ability to operationalize these tools for maximum team impact
Strong proficiency with Terraform and infrastructure-as-code workflows
Experience with multi-cloud deployments and operating resilient systems at scale
Passion for enabling product engineers through training and pairing on observability patterns
Ability to drive cross-functional initiatives that improve system health and team effectiveness
Skilled at explaining complex infrastructure and observability concepts to both technical and non-technical audiences
Calm and decisive under pressure, especially during incident response

Responsibilities

Design and implement observability patterns—including metrics, logging, tracing, and alerting—to ensure we have clear, actionable visibility into platform behavior and performance
Build internal tooling and dashboards
Empower our teams with real-time system insights
Define and maintain SLIs and SLOs in partnership with product and engineering teams
Establish best practices for alert tuning and signal-to-noise balancing to reduce incident fatigue and improve response accuracy
Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation
Leverage Terraform and infrastructure-as-code to ensure consistent, reliable deployments across AWS and Google Cloud
Collaborate with engineers across teams to embed resilient design and observability from the ground up
Provide training and pairing support to product engineers, helping them build and maintain telemetry that supports the full software lifecycle

Preferred Qualifications

Experience with Sentinel, Loki, or similar logging/metrics stacks
Exposure to educational or compliance-heavy environments
Strong debugging skills and a calm presence during incidents

Benefits

Flexibility of working from home, while fostering a unique culture built on relationships, trust, communication, and collaboration with our team - no matter where they live
Unlimited time off to empower our employees to manage their work-life balance
Choice of employer-paid health insurance plans so that you can take care of yourself and your family
Dental and vision are also offered at very low premiums
Every employee is offered generous stock options, vested over 4 years
Plus a 401k match & monthly wellness stipend

Senior Site Reliability Engineer

MagicSchool AI

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Trase

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior