Summary
Join HashiCorp Boundary's Reliability Engineering team as a Senior Engineer and contribute to the seamless remote access experience for our customers. You will play a key role in enhancing the reliability and scalability of our Boundary Cloud platform. This position requires a deep understanding of production applications at scale, experience with Golang or similar backend applications, and a commitment to quality and collaboration. You will develop and implement best practices for high availability, disaster recovery, and fault tolerance. The role involves leading incident management, building monitoring tools, and participating in a 24/7 on-call rotation. This is a remote position.
Requirements
- 5-7 years of handling production applications at scale: Backend applications written in Golang or similar, Postgresql (or any RDBMS), Observability, and AWS Primitives
- Strive for quality through maintainable code and comprehensive testing from development to deployment
- Clear communication skills while remaining empathetic and kind
- An eagerness to learn through humility and reflection
- Experience debugging live production services
Responsibilities
- Develop a deep understanding of how customers interact with Boundary Cloud and continuously improve reliability and user experience
- Implement and advocate for best practices in high availability, disaster recovery, scalability, and fault tolerance
- Design and build internal developer tools to proactively detect, diagnose, and remediate reliability issues
- Lead and refine incident management processes to minimize downtime and directly improve customer satisfaction
- Enhance service reliability by developing monitoring and observability tooling using SLIs, SLOs, and SLAs
- Deploy, manage, and monitor large-scale Boundary Cloud deployments to ensure optimal performance
- Anticipate potential failures and take proactive steps to mitigate risks before they impact users
- Collaborate with cross-functional teams to refine tools and processes based on real-world production insights
- Participate in a 24/7 on-call rotation, supporting mission-critical production services
Preferred Qualifications
- Working knowledge of industry best practices related to information security
- Working knowledge on AWS Aurora or postgres, Nomad or other orchestration platforms, Traefik or other load balancing technologies
- Experience or willingness to conceive, document and advocate for best practices
Benefits
#LI-Remote
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.