Senior SRE

Heidi Health Logo

Heidi Health

๐Ÿ“Remote - Australia

Summary

Join Heidi, a health tech startup on a mission to revolutionize healthcare delivery, as a Senior Site Reliability Engineer. You will play a crucial role in establishing and scaling our reliability practices, ensuring robust and secure AI-powered healthcare systems. This involves designing and implementing comprehensive observability strategies, managing incidents, defining SLAs/SLOs, and optimizing costs. You will collaborate closely with our engineering team and contribute to a blameless culture. The ideal candidate possesses extensive experience with observability platforms, incident management, and SRE practices. Heidi offers a flexible hybrid work environment, additional paid time off, corporate fitness rates, a personal development budget, equity, and the opportunity to make a global impact.

Requirements

  • Extensive experience with observability platforms (Datadog preferred) and understanding of observability architecture
  • Strong knowledge of OpenTelemetry and modern instrumentation practices
  • Experience implementing APM and RUM in Python and React/React Native environments
  • Track record of establishing incident management processes and fostering a blameless culture
  • Experience defining and implementing SLAs/SLOs for enterprise customers
  • Strong background in monitoring distributed systems and third-party service integrations
  • Experience with cloud infrastructure (AWS required, Azure and GCP beneficial)
  • Proven track record in implementing SRE practices and reliability improvements

Responsibilities

  • Design and implement comprehensive observability strategies using Datadog, or other tooling that you are able to convince us with!
  • Implement OpenTelemetry instrumentation across our backend and frontend services
  • Set up real user monitoring (RUM) and application performance monitoring (APM) to ensure end-to-end visibility
  • Create and maintain dashboards that provide meaningful insights for different stakeholders (technical teams, support, management)
  • Monitor and optimise third-party service integrations, particularly for critical services
  • Establish and implement incident management processes from the ground up
  • Evaluate and implement appropriate incident management tools that integrate with our observability stack
  • Create and maintain incident response playbooks and automated runbooks
  • Lead post-incident reviews and foster a blameless culture
  • Implement and maintain on-call rotations and escalation policies
  • Define and implement SLOs that align with business requirements and customer expectations
  • Set up error budgets and tracking mechanisms
  • Create comprehensive SLA reporting for enterprise customers
  • Design and implement SLI metrics that provide meaningful insights into service health
  • Optimise observability costs through efficient logging and metrics collection
  • Implement log management and retention strategies
  • Fine-tune alerting to minimise alert fatigue while maintaining service reliability
  • Evaluate and recommend cost-effective tooling solutions

Preferred Qualifications

  • Experience with chaos engineering practices
  • Knowledge of automated runbook implementation
  • Healthcare industry experience
  • Understanding of HIPAA or similar healthcare compliance frameworks

Benefits

  • Flexible work with a 50% hybrid environment
  • Additional paid day off for your birthday and wellness days
  • Special corporate rates at Anytime Fitness in Melbourne, Sydney tbc
  • A generous personal development budget of $500 per annum
  • Become an owner, with shares (equity) in the company, if Heidiย  wins, we all win

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs