Principal Engineer, Site Reliability Engineering, Observability at SentinelOne

Summary

Join our Site Reliability Engineering (SRE) Team at SentinelOne as an experienced Principal Engineer to architect and lead the implementation of advanced observability, automated triage, and self-healing capabilities within our microservices-based SaaS environment.

Requirements

Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment
15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments (or equivalent combination of education, experience, and certifications)
Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, self-healing strategies, and SLO frameworks. Strong understanding of observability platforms, including monitoring, logging, and tracing solutions
Programming & Scripting: Proficient in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting for incident management workflows
Machine Learning & Data Analysis: Experience with machine learning, anomaly detection, or data analytics techniques for real-time alert correlation and triage systems
Cloud Infrastructure: Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes), with experience in infrastructure-as-code (e.g., Terraform)
Problem-Solving & Decision-Making: Ability to make critical architectural decisions with a focus on business impact, reliability, and system performance

Responsibilities

Design and guide the implementation of end-to-end alert correlation, auto-triage, and uto-remediation frameworks that meet the needs of a microservices-based SaaS architecture
Ensure solutions align with business priorities and customer impact goals
Define, implement, and monitor Service Level Objectives (SLOs) in collaboration with product and engineering teams
Establish reliability standards that meet business and customer expectations, driving accountability and transparency around service performance
Partner with software engineers, SREs, and data scientists to implement and refine monitoring, alerting, alert correlation, auto-remediation, and SLO solutions
Lead initiatives to promote best practices and knowledge sharing across all of SentinelOne engineering
Mentor engineers and contribute to a culture of reliability engineering excellence through thought leadership and guidance on advanced SRE principles and practices

Benefits

Medical, Vision, Dental
401(k)
Commuter
Health and Dependent FSA
Unlimited PTO
Industry leading gender-neutral parental leave
Paid Company Holidays
Paid Sick Time
Employee stock purchase program
Disability and life insurance
Employee assistance program
Gym membership reimbursement
Cell phone reimbursement

Principal Engineer, Site Reliability Engineering, Observability

SentinelOne

Job highlights

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Principal

Similar Remote Jobs

Principal Site Reliability Engineer

Boomi

Remote

DevOps

Principal

Associate Principal Engineer, Performance and Site Reliability

Nagarro

Remote

DevOps

Principal

Head Of Site Reliability Engineering

Swirlds Inc

Remote

DevOps

Principal

Site Reliability Engineer/Platform Engineer

Wizeline

Remote

DevOps

Mid-level

Principal Analytics Engineer

Bluecore

Remote

Data

Principal

Senior Software Engineer, Backend

hims & hers

Remote

Software Development

Senior

Engineering Manager, Infrastructure & DevOps

Coconut Software

Remote

DevOps

Manager

Software Architect

InvoiceCloud

Remote

Software Development

Principal