Summary
Join SentinelOne's Site Reliability Engineering (SRE) team as a Senior Staff Engineer! This role, offering 100% remote or hybrid options for US-based individuals, focuses on architecting and implementing advanced observability, automated triage, and self-healing capabilities within our microservices-based SaaS environment. You will drive our organization's evolution towards proactive, scalable incident management and define/implement Service Level Objectives (SLOs) aligned with business goals. This position requires extensive SRE experience, strong technical expertise, and proficiency in programming and scripting. SentinelOne offers a competitive salary, comprehensive benefits, and a collaborative work environment.
Requirements
- Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment
- 10+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments (or equivalent combination of education, experience, and certifications)
- Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, self-healing strategies, and SLO frameworks. Strong understanding of observability platforms, including monitoring, logging, and tracing solutions
- Programming & Scripting: Proficient in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting for incident management workflows
- Machine Learning & Data Analysis: Experience with machine learning, anomaly detection, or data analytics techniques for real-time alert correlation and triage systems
- Cloud Infrastructure: Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes), with experience in infrastructure-as-code (e.g., Terraform)
- Problem-Solving & Decision-Making: Ability to make critical architectural decisions with a focus on business impact, reliability, and system performance
- U.S. Citizenship is required for this position
Responsibilities
- Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks that meet the needs of a microservices-based SaaS architecture
- Ensure solutions align with business priorities and customer impact goals
- Define, implement, and monitor SLOs in collaboration with product and engineering teams
- Establish reliability standards that meet business and customer expectations, driving accountability and transparency around service performance
- Partner with software engineers, SREs, and data scientists to implement and refine monitoring, alerting, alert correlation, auto-remediation, and SLO solutions
- Lead initiatives to promote best practices and knowledge sharing across all of SentinelOne engineering
- Mentor engineers and contribute to a culture of reliability engineering excellence through thought leadership and guidance on advanced SRE principles and practices
Benefits
- Medical, Vision, Dental
- 401(k)
- Commuter
- Health and Dependent FSA
- Unlimited PTO
- Industry leading gender-neutral parental leave
- Paid Company Holidays
- Paid Sick Time
- Employee stock purchase program
- Disability and life insurance
- Employee assistance program
- Gym membership reimbursement
- Cell phone reimbursement