Staff Site Reliability Engineer

Aerospike
Summary
Join Aerospike as a Staff Site Reliability Engineer and become a technical leader in our global SRE organization. You will drive reliability, performance, and scalability across hybrid and multi-cloud environments, mentoring others and championing modern SRE practices. You will play a key role in shaping infrastructure initiatives, from Kubernetes platforms to existing AWS and GCP services. Your impact will span teams as you solve complex problems, influence architecture, and foster a culture of ownership and continuous improvement. This role requires technical leadership, collaboration with various teams, and participation in on-call rotations. You will define and enforce reliability standards, lead incident response, and drive automation.
Requirements
- 8+ years of experience in SRE, DevOps, or infrastructure engineering, including significant time operating production systems at scale
 - Deep hands-on experience with at least one major public cloud (AWS, GCP, Azure), and working knowledge of the others; Azure experience is a plus
 - Production experience with Kubernetes, including operating clusters, Helm, operators, and supporting microservices in real-world environments
 - Strong proficiency in infrastructure-as-code tools such as Terraform and CI/CD automation platforms
 - Expertise in observability tools and practices (Datadog, Prometheus, Grafana, ELK, etc.) and using them to define SLIs and SLOs.; DataDog experience is a plus
 - Programming and scripting ability in one or more languages (Python, Go, Bash, etc.)
 - Experience with large-scale incident response and post-incident review practices
 - Proven ability to mentor other engineers and influence technical strategy across multiple teams
 - Strong communication skills to articulate complex concepts to technical and non-technical stakeholders
 
Responsibilities
- Provide technical leadership across multiple systems and environments, proactively identifying risks, shaping architecture decisions, and improving reliability and performance at scale
 - Lead key infrastructure efforts including Kubernetes platform expansion (AKS, AKO), and application of SRE principles to legacy systems and new cloud offerings
 - Define, measure, and enforce reliability standards through SLIs/SLOs, observability tooling, and incident response frameworks
 - Mentor and guide other SREs by leading design sessions, conducting technical deep dives, and reviewing code, configurations, and infrastructure decisions
 - Partner with product, engineering, and cloud teams to align reliability goals with delivery objectives
 - Lead root cause analyses and implement systemic fixes for issues spanning multiple platforms or services
 - Drive automation-first approaches using IaC, CI/CD pipelines, and scripting to reduce toil and increase deployment confidence
 - Influence cross-functional roadmaps, identifying areas for innovation, technical debt reduction, and long-term scalability
 - Participate in the global on-call rotation, bringing senior-level calm and clarity during incidents and escalations
 
Preferred Qualifications
- Hands-on experience managing and optimizing database deployments and services in production environments, ensuring high availability and performance
 - Familiarity with Aerospike or other distributed databases is a plus
 - Kubernetes or cloud certifications (CKA, CKS, AWS/GCP DevOps/Architect) a plus but not require
 - Track record of influencing architectural decisions across teams or domains
 


