Staff Site Reliability Engineer at Aerospike

Summary

Join Aerospike as a Staff Site Reliability Engineer and become a technical leader in our global SRE organization. You will drive reliability, performance, and scalability across hybrid and multi-cloud environments, mentoring others and championing modern SRE practices. You will play a key role in shaping infrastructure initiatives, from Kubernetes platforms to existing AWS and GCP services. Your impact will span teams as you solve complex problems, influence architecture, and foster a culture of ownership and continuous improvement. This role requires technical leadership, collaboration with various teams, and participation in on-call rotations. You will define and enforce reliability standards, lead incident response, and drive automation.

Requirements

8+ years of experience in SRE, DevOps, or infrastructure engineering, including significant time operating production systems at scale
Deep hands-on experience with at least one major public cloud (AWS, GCP, Azure), and working knowledge of the others; Azure experience is a plus
Production experience with Kubernetes, including operating clusters, Helm, operators, and supporting microservices in real-world environments
Strong proficiency in infrastructure-as-code tools such as Terraform and CI/CD automation platforms
Expertise in observability tools and practices (Datadog, Prometheus, Grafana, ELK, etc.) and using them to define SLIs and SLOs.; DataDog experience is a plus
Programming and scripting ability in one or more languages (Python, Go, Bash, etc.)
Experience with large-scale incident response and post-incident review practices
Proven ability to mentor other engineers and influence technical strategy across multiple teams
Strong communication skills to articulate complex concepts to technical and non-technical stakeholders

Responsibilities

Provide technical leadership across multiple systems and environments, proactively identifying risks, shaping architecture decisions, and improving reliability and performance at scale
Lead key infrastructure efforts including Kubernetes platform expansion (AKS, AKO), and application of SRE principles to legacy systems and new cloud offerings
Define, measure, and enforce reliability standards through SLIs/SLOs, observability tooling, and incident response frameworks
Mentor and guide other SREs by leading design sessions, conducting technical deep dives, and reviewing code, configurations, and infrastructure decisions
Partner with product, engineering, and cloud teams to align reliability goals with delivery objectives
Lead root cause analyses and implement systemic fixes for issues spanning multiple platforms or services
Drive automation-first approaches using IaC, CI/CD pipelines, and scripting to reduce toil and increase deployment confidence
Influence cross-functional roadmaps, identifying areas for innovation, technical debt reduction, and long-term scalability
Participate in the global on-call rotation, bringing senior-level calm and clarity during incidents and escalations

Preferred Qualifications

Hands-on experience managing and optimizing database deployments and services in production environments, ensuring high availability and performance
Familiarity with Aerospike or other distributed databases is a plus
Kubernetes or cloud certifications (CKA, CKS, AWS/GCP DevOps/Architect) a plus but not require
Track record of influencing architectural decisions across teams or domains

Staff Site Reliability Engineer

Aerospike

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Addepar

Remote

DevOps

Mid-level