Staff Site Reliability Engineer

Aerospike
Summary
Join Aerospike as a Staff Site Reliability Engineer and become a technical leader in our global SRE organization. You will drive reliability, performance, and scalability across hybrid and multi-cloud environments, mentoring others and championing modern SRE practices. You will play a key role in shaping infrastructure initiatives, from Kubernetes platforms to existing AWS and GCP services. Your impact will span teams as you solve complex problems, influence architecture, and foster a culture of ownership and continuous improvement. This role requires technical leadership, collaboration with various teams, and participation in on-call rotations. You will define and enforce reliability standards, lead incident response, and drive automation.
Requirements
- 8+ years of experience in SRE, DevOps, or infrastructure engineering, including significant time operating production systems at scale
- Deep hands-on experience with at least one major public cloud (AWS, GCP, Azure), and working knowledge of the others; Azure experience is a plus
- Production experience with Kubernetes, including operating clusters, Helm, operators, and supporting microservices in real-world environments
- Strong proficiency in infrastructure-as-code tools such as Terraform and CI/CD automation platforms
- Expertise in observability tools and practices (Datadog, Prometheus, Grafana, ELK, etc.) and using them to define SLIs and SLOs.; DataDog experience is a plus
- Programming and scripting ability in one or more languages (Python, Go, Bash, etc.)
- Experience with large-scale incident response and post-incident review practices
- Proven ability to mentor other engineers and influence technical strategy across multiple teams
- Strong communication skills to articulate complex concepts to technical and non-technical stakeholders
Responsibilities
- Provide technical leadership across multiple systems and environments, proactively identifying risks, shaping architecture decisions, and improving reliability and performance at scale
- Lead key infrastructure efforts including Kubernetes platform expansion (AKS, AKO), and application of SRE principles to legacy systems and new cloud offerings
- Define, measure, and enforce reliability standards through SLIs/SLOs, observability tooling, and incident response frameworks
- Mentor and guide other SREs by leading design sessions, conducting technical deep dives, and reviewing code, configurations, and infrastructure decisions
- Partner with product, engineering, and cloud teams to align reliability goals with delivery objectives
- Lead root cause analyses and implement systemic fixes for issues spanning multiple platforms or services
- Drive automation-first approaches using IaC, CI/CD pipelines, and scripting to reduce toil and increase deployment confidence
- Influence cross-functional roadmaps, identifying areas for innovation, technical debt reduction, and long-term scalability
- Participate in the global on-call rotation, bringing senior-level calm and clarity during incidents and escalations
Preferred Qualifications
- Hands-on experience managing and optimizing database deployments and services in production environments, ensuring high availability and performance
- Familiarity with Aerospike or other distributed databases is a plus
- Kubernetes or cloud certifications (CKA, CKS, AWS/GCP DevOps/Architect) a plus but not require
- Track record of influencing architectural decisions across teams or domains