Summary

Join Braze as a Senior Site Reliability Engineer and ensure the smooth operation of internal services and platforms. Collaborate with engineering teams to architect products for scalability and reliability, debug issues across all stack layers, and develop infrastructure using tools like Chef, Terraform, and Kubernetes. Manage incidents, participate in PagerDuty rotations, and improve systems through retrospectives. You will create deployment pipelines, provide centralized tooling, and manage incidents. This role requires 5+ years of experience as a Software, DevOps, or Site Reliability Engineer, including expertise in Kafka and infrastructure automation. Braze offers competitive compensation, equity, retirement plans, flexible PTO, comprehensive benefits, and professional development opportunities.

Requirements

5+ years of experience as a Software, DevOps, or Site Reliability Engineer
3+ years of Data Streaming Reliability Engineering Experience in monitoring, troubleshooting, and optimizing Kafka streaming applications, including diagnosing lag, partition imbalances, consumer group issues, and broker failures
Expertise in setting up alerting, dashboards, and runbooks for high-availability and fault-tolerant streaming pipelines
3+ years of Kafka performance tuning & automation Strong background in scaling Kafka clusters, tuning producer/consumer configurations, and managing schema evolution
Proficiency in infrastructure automation (Terraform, Ansible, Kubernetes) and CI/CD practices to streamline deployments and ensure resilient data streaming workflows
You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
Have an urge to collaborate, document, and deliver quickly Collaborating across the global remote teams, often working asynchronously
Document everything so you don't need to learn the same thing (or plan the same work) twice
Delivering fast to delight our customers– even internal ones
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
Have a desire to solve everyday challenges facing software engineers and automate their toil away
Have an excellent ability to manage multiple tasks and expectations at once
Know your way around Linux and Unix Shell

Responsibilities

Partner with Braze’s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms Make monitoring and alerting alerts on symptoms and not on outages Ensure that Braze meets our strict enterprise-grade SLAs with customers
Develop Braze’s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc. Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze’s engineering teams
Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers Use your on-call shift to prevent incidents from ever happening Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc

Preferred Qualifications

Have strong programming skills - Ruby and/or Go preferred
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

Benefits

Competitive compensation that may include equity
Retirement and Employee Stock Purchase Plans
Flexible paid time off
Comprehensive benefit plans covering medical, dental, vision, life, and disability
Family services that include fertility benefits and equal paid parental leave
Professional development supported by formal career pathing, learning platforms, and a yearly learning stipend

Senior Site Reliability Engineer II

Braze

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Instacart

Remote

DevOps

Senior

Instacart

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Entry Level

Remote

DevOps

Mid-level

Remote

All Others

Manager