Senior SRE at Heidi Health

Summary

Join Heidi, a health tech startup on a mission to revolutionize healthcare delivery, as a Senior Site Reliability Engineer. You will play a crucial role in establishing and scaling our reliability practices, ensuring robust and secure AI-powered healthcare systems. This involves designing and implementing comprehensive observability strategies, managing incidents, defining SLAs/SLOs, and optimizing costs. You will collaborate closely with our engineering team and contribute to a blameless culture. The ideal candidate possesses extensive experience with observability platforms, incident management, and SRE practices. Heidi offers a flexible hybrid work environment, additional paid time off, corporate fitness rates, a personal development budget, equity, and the opportunity to make a global impact.

Requirements

Extensive experience with observability platforms (Datadog preferred) and understanding of observability architecture
Strong knowledge of OpenTelemetry and modern instrumentation practices
Experience implementing APM and RUM in Python and React/React Native environments
Track record of establishing incident management processes and fostering a blameless culture
Experience defining and implementing SLAs/SLOs for enterprise customers
Strong background in monitoring distributed systems and third-party service integrations
Experience with cloud infrastructure (AWS required, Azure and GCP beneficial)
Proven track record in implementing SRE practices and reliability improvements

Responsibilities

Design and implement comprehensive observability strategies using Datadog, or other tooling that you are able to convince us with!
Implement OpenTelemetry instrumentation across our backend and frontend services
Set up real user monitoring (RUM) and application performance monitoring (APM) to ensure end-to-end visibility
Create and maintain dashboards that provide meaningful insights for different stakeholders (technical teams, support, management)
Monitor and optimise third-party service integrations, particularly for critical services
Establish and implement incident management processes from the ground up
Evaluate and implement appropriate incident management tools that integrate with our observability stack
Create and maintain incident response playbooks and automated runbooks
Lead post-incident reviews and foster a blameless culture
Implement and maintain on-call rotations and escalation policies
Define and implement SLOs that align with business requirements and customer expectations
Set up error budgets and tracking mechanisms
Create comprehensive SLA reporting for enterprise customers
Design and implement SLI metrics that provide meaningful insights into service health
Optimise observability costs through efficient logging and metrics collection
Implement log management and retention strategies
Fine-tune alerting to minimise alert fatigue while maintaining service reliability
Evaluate and recommend cost-effective tooling solutions

Preferred Qualifications

Experience with chaos engineering practices
Knowledge of automated runbook implementation
Healthcare industry experience
Understanding of HIPAA or similar healthcare compliance frameworks

Benefits

Flexible work with a 50% hybrid environment
Additional paid day off for your birthday and wellness days
Special corporate rates at Anytime Fitness in Melbourne, Sydney tbc
A generous personal development budget of $500 per annum
Become an owner, with shares (equity) in the company, if Heidi wins, we all win

Senior SRE

Heidi Health

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Assured

Remote

DevOps

Senior

Nas Company

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Genesis Therapeutics

Remote

DevOps

Senior

SMG Swiss Marketplace Group

Remote

DevOps

Senior

Instacart

Remote

Software Development

Senior