Site Reliability Engineer

Input Output Logo

Input Output

πŸ“Remote - United Kingdom

Summary

Join IOHK, a blockchain technology company, as a Site Reliability Engineer (SRE) and play a crucial role in ensuring the reliability and performance of our open-source production systems. You will design, develop, and maintain tools and software using Python, Bash, Terraform, or Nix to improve service availability and scalability. This role involves collaborating with development teams, analyzing system performance, and participating in on-call rotations. You will need proficiency in Python, Bash, Terraform, and Nix, along with extensive AWS experience and knowledge of Kubernetes and PostgreSQL. Excellent communication and troubleshooting skills are essential. IOHK offers remote work, laptop reimbursement, a new starter package, learning and development opportunities, and competitive PTO.

Requirements

  • Proficiency in Python, Bash, Terraform, Nix for DevOps services
  • Extensive experience with AWS, specifically with services like EKS and RDS
  • Familiarity with Container orchestration (e.g. Kubernetes) is essential
  • Hands-on experience with PostgreSQL and its deployment on RDS
  • Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki)
  • Solid troubleshooting and performance tuning capabilities
  • Exceptional communication skills and team collaboration ethic
  • Experience with CI/CD (e.g. Github Actions, Hydra, Earthly)
  • Strong analytical and troubleshooting skills
  • Excellent communication skills to collaborate with development teams, operations, and other stakeholders
  • Ability to quickly learn new technologies and adapt to changing environments
  • High attention to detail to ensure system reliability and performance

Responsibilities

  • Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services
  • Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement
  • Practice sustainable incident response and promote blameless postmortems
  • Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind
  • Analyze system performance and reliability, offering recommendations for enhancement
  • Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services
  • Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges

Benefits

  • Remote work
  • Laptop reimbursement
  • New starter package to buy hardware essentials (headphones, monitor, etc)
  • Learning & Development opportunities
  • Competitive PTO

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.