Lead Site Reliability Engineer at Kontakt.io

Summary

Join Kontakt.io, a company building a platform for care operations, and become their Lead Site Reliability Engineer. This role focuses on ensuring the reliability, performance, and automation of their cloud-based, real-time platform. You will lead and scale the SRE team, maintaining 99.99% uptime and minimizing downtime. Responsibilities include designing self-healing systems, defining SLAs, managing cloud infrastructure (AWS), optimizing containerized environments, and leading incident response. The ideal candidate possesses 10+ years of SRE or cloud infrastructure experience, software engineering expertise, and deep knowledge of cloud platforms, Kubernetes, and distributed systems. This position offers the opportunity to make a significant impact in healthcare.

Requirements

10+ years of experience in Site Reliability Engineering or Cloud Infrastructure
2+ years of experience as a software engineer
Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare
Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems
Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools
Hands-on experience with incident management, postmortems, and building resilient systems
Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.)
A mature leadership approach , with the ability to drive technical strategy while growing and mentoring a high-performance SRE team
Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2)

Responsibilities

Ensure 99.99 % uptime across our cloud platform, meeting strict SLAs for healthcare customers
Leverage your software engineering expertise to write high-quality, maintainable code that improves system reliability and operational efficiency
Design and implement self-healing, fault-tolerant systems to prevent failures before they happen
Define SLIs, SLOs, and SLAs , ensuring proactive performance monitoring and incident resolution
Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing
Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments
Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management
Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog
Lead incident response and on-call operations , reducing mean time to detection (MTTD) and mean time to resolution (MTTR)
Conduct blameless postmortems and continuously improve system resilience
Reduce manual intervention through automated deployment, scaling, and failover mechanisms
Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards
Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available
Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering
Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities

Preferred Qualifications

Experience with healthcare IT , including EHR data, FHIR, and HL7 interoperability
Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines
Prior experience leading on-call rotations and major incident management processes

Benefits

Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99 % uptime healthcare platform
Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery
Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI
Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies
Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges

Lead Site Reliability Engineer

Kontakt.io

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

DC SCORES

Remote

DevOps

Senior

DC SCORES

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior