Site Reliability Engineer at Hack The Box

Summary

Join Hack The Box as a Site Reliability Engineer (SRE) and play a pivotal role in migrating the company's infrastructure to AWS. Over the next six months, you will enhance scalability, implement Kubernetes clusters, and establish key performance indicators. You will collaborate with a team of six other SREs, engineers, data scientists, and security experts. The position offers a fully remote or hybrid work model, with opportunities for mentorship and professional development. Hack The Box provides a supportive and fun work environment with various benefits, including private insurance, 25 annual leave days, and a dedicated training budget.

Requirements

Hands-on Experience: Minimum 2 years of hands-on experience in site reliability engineering or a related field
Automation Skills: Proficient in scripting and automation using languages such as Go, Python or Bash
Cloud Expertise: In-depth knowledge of cloud platforms, particularly AWS
Containerization: Experience with containerization technologies (Docker) and orchestration (Kubernetes)
Monitoring Mastery: Strong expertise in implementing and managing monitoring and logging solutions
Metrics Framework: Proven experience establishing and managing SLAs, SLOs, and SLIs
Problem Solving: Proven ability to troubleshoot complex system issues and implement effective solutions
Collaborative Mindset: Excellent collaboration and communication skills, with a strong ability to work cross-functionally and mentor junior team members

Responsibilities

Heavily contribute to the AWS Migration for Scalability: Spearhead the migration from the current cloud provider towards AWS, strategically positioning our infrastructure for scalable growth across regions
Expand Monitoring Stack: Integrate new systems into the Monitoring Stack, enhancing visibility and alerting capabilities for a globally distributed architecture
Architectural Design for Reliability: Contribute to the design and implementation of reliable AWS infrastructure, focusing on fault tolerance and high availability
Establish Metrics Framework: Implement and manage Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and improve system reliability
Incident Response Enhancement: Develop and enhance incident response processes, leveraging metrics to continually improve response times and effectiveness
Mentorship: Mentor and guide junior SREs in adapting to the AWS environment and implementing reliability best practices
Collaborative Planning: Work closely with cross-functional teams to plan and implement new systems effectively, ensuring alignment with reliability goals
Team Expansion: Play a key role in the team's expansion, contributing to the mentoring junior members
Best Practices Advocacy: Champion best practices in AWS architecture and SRE methodologies, fostering a culture of reliability and continuous improvement

Benefits

Private insurance
25 annual leave days
Dedicated budget for training and professional development, participation in conferences
State-of-the-art equipment (Macbook, iPhone, and mobile plan)
Free lunch & snacks at the office
Full access to the Hack The Box lab offerings; so you can learn how to hack
Flexible/Hybrid working

Site Reliability Engineer

Hack The Box

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

Kraken Digital Asset Exchange

Remote

DevOps

Mid-level

GoDaddy

Remote

DevOps

Mid-level

Remote

DevOps

Senior