Staff Site Reliability Engineer

Addepar
Summary
Join Addepar's Production Engineering and SRE team as a highly experienced and impactful colleague to drive the transformation of its platform towards high-level declarative infrastructure orchestration and operations. This role involves evolving the platform to integrate compute, network, and storage control planes, enabling efficient and fast-to-iterate services tailored to various product areas. The ideal candidate will lead in implementing, maintaining, and strategically evolving Addeparβs Production Infrastructure, bringing innovative solutions and extensive hands-on development experience in AWS/cloud, Linux/Unix, networking, scripting, containerization, Kubernetes, Terraform, Information Security, debugging, and monitoring/observability skills. This position requires designing, deploying, monitoring, automating, and optimizing all operational aspects of Addepar's platform, focusing on reliability, scalability, and efficiency. The role demands collaboration with cross-functional teams and serving as a primary on-call responder for critical incidents.
Requirements
- Extensive progressive experience in the SRE/DevOps/Systems Engineer field, with a track record of taking on increasing responsibility
- Expert-level understanding of Cloud Infrastructure fundamentals (AWS preferred) , including advanced networking, security, and managed services
- Exceptional Programming/Scripting skills in various common languages (Python , Bash, and general Linux tools are essential; Java is a strong plus), with an emphasis on building scalable, maintainable automation and tools
- Broad and deep expertise with UNIX/BSD/Linux internals (Ubuntu preferred) , including performance tuning, kernel-level debugging, and advanced system administration
- Extensive Containerization experience with k8s (KOPS, EKS, ECS preferred) , including cluster management, custom resource definitions (CRDs), and advanced deployment strategies
- Demonstrable experience leading initiatives with infrastructure-as-code tools such as Terraform in complex, multi-account environments
- Proficient experience with comprehensive monitoring, logging, and alerting tools such as Prometheus, Grafana, Sentry, Sumologic, or advanced AWS cloud-native tools, with a focus on observability strategy
- Excellent interpersonal and communication skills to effectively collaborate with multi-functional teams, articulate complex technical concepts, and influence outcomes
Responsibilities
- Lead the design, implementation, and operationalization of container infrastructure using Kubernetes (k8s), ensuring high availability, performance, and security
- Architect, build, and maintain advanced, automated CI/CD pipelines using Jenkins, ArgoCD, AWS CodeBuild/Pipeline, GitHub Actions, or similar, establishing best practices for deployment strategies (e.g., blue/green, canary)
- Drive the adoption and evangelism of Infrastructure as Code (IaC) principles using Terraform, focusing on scaling the Addepar Platform across regions with a focus on cost optimization and operational efficiency
- Develop deep application-level knowledge to proactively inform and influence infrastructure requirements and constraints for Developers, QA, and Management, including implementing sophisticated dashboards for Cost and Inventory management, performance analysis, and capacity planning
- Perform advanced monitoring and troubleshooting of our infrastructure and application stack using a wide array of logging/monitoring tools, driving root cause analysis and implementing preventative measures
- Initiate and lead collaborations with cross-functional teams to identify and resolve complex Application or infrastructure issues, serving as a technical subject matter expert
- Serve as a primary on-call responder for critical incidents , demonstrating strong problem-solving skills under pressure and contributing to post-incident reviews to improve system resilience
Preferred Qualifications
- Demonstrable experience writing and contributing to significant systems automation tooling or open-source projects is a strong plus
- Exposure to industry practices in financial services is a plus