Site Reliability Engineering

Input Output Logo

Input Output

πŸ“Remote - United Kingdom

Summary

Join IOG's Midnight Tribe as the Head of Site Reliability Engineering (SRE) and lead the infrastructure and reliability strategy for the Midnight Network, a blockchain platform focused on data protection. In this senior leadership role, you will own the reliability, scalability, and performance of the platform, building and leading a high-performing SRE team. You will be instrumental in setting the foundations of our infrastructure, designing globally scalable systems, and ensuring high availability within a blockchain architecture. This hands-on role demands technical depth, architectural vision, operational rigor, and strong people leadership skills. You will lead the SRE team, drive initiatives to enhance service reliability, and oversee the entire service lifecycle. Collaborate with engineering and testing teams to build robust production systems and ensure sustainable incident response.

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or a related field
  • At least 8 years in a Reliability Engineering, DevOps or infrastructure focused role
  • Proven track record of leading and managing a high-performing SRE team
  • Experience writing code in Python, Rust/C++ or JavaScript
  • Proven years of experience in Build and Release engineering, Linux operational excellence and automation
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of drive
  • You will be someone who works well on your own and with a team
  • You are kind and respectful of others’ opinions and you are open and act with integrity when engaging in academic or technical discussions
  • Proven experience in capacity planning, performance monitoring, and optimization to ensure systems can handle current and future loads efficiently
  • System engineering experience working with application servers, containers, and web servers
  • Demonstrated ability to analyze incidents, identify root causes, and implement preventive measures to reduce the likelihood of recurring issues
  • Strong understanding of cloud architecture including the major cloud providers (AWS, GCP, etc)
  • Experience working with Docker containers and related orchestration technologies (such as Kubernetes or ECS)
  • Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc)
  • Understand underlying networking and security considerations when developing the architecture of our deployment environments
  • Fluency in git based workflows, commit discipline
  • Experience in providing mentorship and coaching to team members

Responsibilities

  • Lead the SRE team, sharing expertise and best practices.Β  Coach, mentor and develop SRE team
  • Demonstrate leadership in driving initiatives that enhance service reliability, scalability, and overall performance
  • Lead the entire lifecycle of services, including inception, design, deployment, operation, and refinement
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews
  • Oversee the maintenance of live services by continuously measuring and monitoring factors like availability, latency, and overall system health
  • Assist our teams in creating software that is both simple and flexible to configure and deploy
  • Lead sustainable incident response practices, ensuring timely resolution with a focus on minimizing impact
  • Collaborate with software engineering and testing teams to establish and maintain automated regression suite infrastructure and performance testing
  • Sustainably scale systems through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity
  • Conduct blameless postmortems to analyze incidents, identify root causes, and implement preventive measures

Benefits

  • Remote work
  • Laptop reimbursement
  • New starter package to buy hardware essentials (headphones, monitor, etc)
  • Learning & Development opportunities
  • Competitive PTO

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.