Site Reliability Engineer

Hack The Box Logo

Hack The Box

πŸ“Remote - Greece

Summary

Join Hack The Box as a Site Reliability Engineer (SRE) and play a pivotal role in migrating the company's infrastructure to AWS. Over the next six months, you will enhance scalability, implement Kubernetes clusters, and establish key performance indicators. You will collaborate with a team of six other SREs, engineers, data scientists, and security experts. The position offers a fully remote or hybrid work model, with opportunities for mentorship and professional development. Hack The Box provides a supportive and fun work environment with various benefits, including private insurance, 25 annual leave days, and a dedicated training budget.

Requirements

  • Hands-on Experience: Minimum 2 years of hands-on experience in site reliability engineering or a related field
  • Automation Skills: Proficient in scripting and automation using languages such as Go, Python or Bash
  • Cloud Expertise: In-depth knowledge of cloud platforms, particularly AWS
  • Containerization: Experience with containerization technologies (Docker) and orchestration (Kubernetes)
  • Monitoring Mastery: Strong expertise in implementing and managing monitoring and logging solutions
  • Metrics Framework: Proven experience establishing and managing SLAs, SLOs, and SLIs
  • Problem Solving: Proven ability to troubleshoot complex system issues and implement effective solutions
  • Collaborative Mindset: Excellent collaboration and communication skills, with a strong ability to work cross-functionally and mentor junior team members

Responsibilities

  • Heavily contribute to the AWS Migration for Scalability: Spearhead the migration from the current cloud provider towards AWS, strategically positioning our infrastructure for scalable growth across regions
  • Expand Monitoring Stack: Integrate new systems into the Monitoring Stack, enhancing visibility and alerting capabilities for a globally distributed architecture
  • Architectural Design for Reliability: Contribute to the design and implementation of reliable AWS infrastructure, focusing on fault tolerance and high availability
  • Establish Metrics Framework: Implement and manage Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and improve system reliability
  • Incident Response Enhancement: Develop and enhance incident response processes, leveraging metrics to continually improve response times and effectiveness
  • Mentorship: Mentor and guide junior SREs in adapting to the AWS environment and implementing reliability best practices
  • Collaborative Planning: Work closely with cross-functional teams to plan and implement new systems effectively, ensuring alignment with reliability goals
  • Team Expansion: Play a key role in the team's expansion, contributing to the mentoring junior members
  • Best Practices Advocacy: Champion best practices in AWS architecture and SRE methodologies, fostering a culture of reliability and continuous improvement

Benefits

  • Private insurance
  • 25 annual leave days
  • Dedicated budget for training and professional development, participation in conferences
  • State-of-the-art equipment (Macbook, iPhone, and mobile plan)
  • Free lunch & snacks at the office
  • Full access to the Hack The Box lab offerings; so you can learn how to hack
  • Flexible/Hybrid working

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.