Senior SRE
Heidi Health
Summary
Join Heidi, a health tech startup on a mission to revolutionize healthcare delivery, as a Senior Site Reliability Engineer. You will play a crucial role in establishing and scaling our reliability practices, ensuring robust and secure AI-powered healthcare systems. This involves designing and implementing comprehensive observability strategies, managing incidents, defining SLAs/SLOs, and optimizing costs. You will collaborate closely with our engineering team and contribute to a blameless culture. The ideal candidate possesses extensive experience with observability platforms, incident management, and SRE practices. Heidi offers a flexible hybrid work environment, additional paid time off, corporate fitness rates, a personal development budget, equity, and the opportunity to make a global impact.
Requirements
- Extensive experience with observability platforms (Datadog preferred) and understanding of observability architecture
- Strong knowledge of OpenTelemetry and modern instrumentation practices
- Experience implementing APM and RUM in Python and React/React Native environments
- Track record of establishing incident management processes and fostering a blameless culture
- Experience defining and implementing SLAs/SLOs for enterprise customers
- Strong background in monitoring distributed systems and third-party service integrations
- Experience with cloud infrastructure (AWS required, Azure and GCP beneficial)
- Proven track record in implementing SRE practices and reliability improvements
Responsibilities
- Design and implement comprehensive observability strategies using Datadog, or other tooling that you are able to convince us with!
- Implement OpenTelemetry instrumentation across our backend and frontend services
- Set up real user monitoring (RUM) and application performance monitoring (APM) to ensure end-to-end visibility
- Create and maintain dashboards that provide meaningful insights for different stakeholders (technical teams, support, management)
- Monitor and optimise third-party service integrations, particularly for critical services
- Establish and implement incident management processes from the ground up
- Evaluate and implement appropriate incident management tools that integrate with our observability stack
- Create and maintain incident response playbooks and automated runbooks
- Lead post-incident reviews and foster a blameless culture
- Implement and maintain on-call rotations and escalation policies
- Define and implement SLOs that align with business requirements and customer expectations
- Set up error budgets and tracking mechanisms
- Create comprehensive SLA reporting for enterprise customers
- Design and implement SLI metrics that provide meaningful insights into service health
- Optimise observability costs through efficient logging and metrics collection
- Implement log management and retention strategies
- Fine-tune alerting to minimise alert fatigue while maintaining service reliability
- Evaluate and recommend cost-effective tooling solutions
Preferred Qualifications
- Experience with chaos engineering practices
- Knowledge of automated runbook implementation
- Healthcare industry experience
- Understanding of HIPAA or similar healthcare compliance frameworks
Benefits
- Flexible work with a 50% hybrid environment
- Additional paid day off for your birthday and wellness days
- Special corporate rates at Anytime Fitness in Melbourne, Sydney tbc
- A generous personal development budget of $500 per annum
- Become an owner, with shares (equity) in the company, if Heidiย wins, we all win