Senior Staff Site Reliability Engineer - Observability

closed
SentinelOne Logo

SentinelOne

πŸ’΅ $198k-$270k
πŸ“Remote - United States

Summary

Join SentinelOne's Site Reliability Engineering (SRE) team as a Senior Staff Engineer! This role, offering 100% remote or hybrid options for US-based individuals, focuses on architecting and implementing advanced observability, automated triage, and self-healing capabilities within our microservices-based SaaS environment. You will drive our organization's evolution towards proactive, scalable incident management and define/implement Service Level Objectives (SLOs) aligned with business goals. This position requires extensive SRE experience, strong technical expertise, and proficiency in programming and scripting. SentinelOne offers a competitive salary, comprehensive benefits, and a collaborative work environment.

Requirements

  • Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment
  • 10+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments (or equivalent combination of education, experience, and certifications)
  • Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, self-healing strategies, and SLO frameworks. Strong understanding of observability platforms, including monitoring, logging, and tracing solutions
  • Programming & Scripting: Proficient in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting for incident management workflows
  • Machine Learning & Data Analysis: Experience with machine learning, anomaly detection, or data analytics techniques for real-time alert correlation and triage systems
  • Cloud Infrastructure: Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes), with experience in infrastructure-as-code (e.g., Terraform)
  • Problem-Solving & Decision-Making: Ability to make critical architectural decisions with a focus on business impact, reliability, and system performance
  • U.S. Citizenship is required for this position

Responsibilities

  • Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks that meet the needs of a microservices-based SaaS architecture
  • Ensure solutions align with business priorities and customer impact goals
  • Define, implement, and monitor SLOs in collaboration with product and engineering teams
  • Establish reliability standards that meet business and customer expectations, driving accountability and transparency around service performance
  • Partner with software engineers, SREs, and data scientists to implement and refine monitoring, alerting, alert correlation, auto-remediation, and SLO solutions
  • Lead initiatives to promote best practices and knowledge sharing across all of SentinelOne engineering
  • Mentor engineers and contribute to a culture of reliability engineering excellence through thought leadership and guidance on advanced SRE principles and practices

Benefits

  • Medical, Vision, Dental
  • 401(k)
  • Commuter
  • Health and Dependent FSA
  • Unlimited PTO
  • Industry leading gender-neutral parental leave
  • Paid Company Holidays
  • Paid Sick Time
  • Employee stock purchase program
  • Disability and life insurance
  • Employee assistance program
  • Gym membership reimbursement
  • Cell phone reimbursement
This job is filled or no longer available

Similar Remote Jobs