Lead Site Reliability Engineer

Kontakt.io
Summary
Join Kontakt.io, a company building a platform for care operations, and become their Lead Site Reliability Engineer. This role focuses on ensuring the reliability, performance, and automation of their cloud-based, real-time platform. You will lead and scale the SRE team, maintaining 99.99% uptime and minimizing downtime. Responsibilities include designing self-healing systems, defining SLAs, managing cloud infrastructure (AWS), optimizing containerized environments, and leading incident response. The ideal candidate possesses 10+ years of SRE or cloud infrastructure experience, software engineering expertise, and deep knowledge of cloud platforms, Kubernetes, and distributed systems. This position offers the opportunity to make a significant impact in healthcare.
Requirements
- 10+ years of experience in Site Reliability Engineering or Cloud Infrastructure
- 2+ years of experience as a software engineer
- Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare
- Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems
- Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools
- Hands-on experience with incident management, postmortems, and building resilient systems
- Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.)
- A mature leadership approach , with the ability to drive technical strategy while growing and mentoring a high-performance SRE team
- Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2)
Responsibilities
- Ensure 99.99 % uptime across our cloud platform, meeting strict SLAs for healthcare customers
- Leverage your software engineering expertise to write high-quality, maintainable code that improves system reliability and operational efficiency
- Design and implement self-healing, fault-tolerant systems to prevent failures before they happen
- Define SLIs, SLOs, and SLAs , ensuring proactive performance monitoring and incident resolution
- Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing
- Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments
- Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management
- Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog
- Lead incident response and on-call operations , reducing mean time to detection (MTTD) and mean time to resolution (MTTR)
- Conduct blameless postmortems and continuously improve system resilience
- Reduce manual intervention through automated deployment, scaling, and failover mechanisms
- Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards
- Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available
- Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering
- Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities
Preferred Qualifications
- Experience with healthcare IT , including EHR data, FHIR, and HL7 interoperability
- Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines
- Prior experience leading on-call rotations and major incident management processes
Benefits
- Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99 % uptime healthcare platform
- Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery
- Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI
- Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies
- Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges