Remote Site Reliability Engineer

closed
Logo of Braze

Braze

๐Ÿ“Remote - Canada

Job highlights

Summary

Join our team of Site Reliability Engineers at Braze, where you'll collaborate with engineering teams to improve infrastructure, automation, and tooling. As a SRE, you'll ensure site uptime, architect products for scalability, and develop internal platform infrastructure.

Requirements

  • 3+ years of experience as a Software, DevOps, or Site Reliability Engineer
  • You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
  • Have an urge to collaborate, document, and deliver quickly
  • Collaborating across the global remote teams, often working asynchronously
  • Document everything so you don't need to learn the same thing (or plan the same work) twice
  • Delivering fast to delight our customersโ€“ even internal ones
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a desire to solve everyday challenges facing software engineers and automate their toil away
  • Have an excellent ability to manage multiple tasks and expectations at once
  • Know your way around Linux and Unix Shell
  • Have strong programming skills - Ruby and/or Go preferred
  • Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies
  • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies

Responsibilities

  • Partner with Brazeโ€™s engineering teams on: Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
  • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms
  • Make monitoring and alerting alerts on symptoms and not on outages
  • Ensure that Braze meets our strict enterprise-grade SLAs with customers
  • Develop Brazeโ€™s internal platform infrastructure: Create Infrastructure as code using Chef, Terraform, and Kubernetes
  • Develop deployment pipelines for applications in multiple languages using Docker, Kubernetes, etc
  • Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Brazeโ€™s engineering teams
  • Manage incidents: Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
  • Use your on-call shift to prevent incidents from ever happening
  • Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc

Benefits

  • Competitive compensation that may include equity
  • Retirement and Employee Stock Purchase Plans
  • Flexible paid time off
  • Comprehensive benefit plans covering medical, dental, vision, life, and disability
  • Family services that include fertility benefits and equal paid parental leave
  • Professional development supported by formal career pathing, learning platforms, and tuition reimbursement
  • Community engagement opportunities throughout the year, including an annual company wide Volunteer Week
  • Employee Resource Groups that provide supportive communities within Braze
  • Collaborative, transparent, and fun culture recognized as a Great Place to Workยฎ
This job is filled or no longer available