Remote Site Reliability Engineer Technical Lead

closed
Logo of Nethermind

Nethermind

πŸ“Remote - EU

Job highlights

Summary

Join a team of builders and researchers on a mission to empower enterprises and developers worldwide to access and build on decentralized systems. We're seeking an experienced Site Reliability Engineer to lead and mentor our SRE team.

Requirements

  • 5+ years of experience in Site Reliability Engineering or DevOps
  • Expert knowledge of cloud platforms (AWS, GCP)
  • Expert knowledge of Kubernetes
  • Proven experience in designing and implementing scalable, efficient, resilient systems
  • Deep understanding of Linux/Unix systems and networking protocols
  • Strong programming skills in Python or Go
  • Strong background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki)
  • Expertise in CI/CD tools (e.g. GitHub Actions, ArgoCD)
  • Excellent communication skills, both written and verbal, with the ability to explain complex technical concepts to various audiences
  • Experience in producing technical documentation, runbooks, presentations, and post-mortem reports
  • Experience and passion for mentoring and upskilling team members

Responsibilities

  • Lead the implementation and refinement of SRE practices across the organization, including SLOs, error budgets, and blameless postmortems
  • Design and implement automation to eliminate toil and improve system reliability and efficiency
  • Lead initiatives and architect scalable hybrid cloud solutions for Web3 infrastructure
  • Manage error budgets and make data-driven decisions about when to prioritize reliability vs. new features
  • Drive SRE practices to ensure high availability, performance, and reliability under varying load conditions
  • Collaborate closely with Platform engineering team to build reliability into services from the ground up
  • Collaborate closely with Nethermind’s Infrastructure Leadership department to align SRE strategies with overall technical vision
  • Drive the adoption of observability best practices and implement comprehensive monitoring systems
  • Develop and maintain service level indicators (SLIs) and objectives (SLOs), working with product owners to define appropriate reliability targets
  • Mentor team members in SRE practices and foster a culture of continuous learning
  • Lead capacity planning efforts, using quantitative analysis to predict and address future scaling challenges
  • Contribute to long-term technical roadmaps, balancing reliability concerns with product innovation

Preferred Qualifications

  • Experience leading technical teams
  • Contributions to open-source projects or thought leadership in SRE
  • Familiarity with MLOps and big data technologies
  • Knowledge of blockchain technology and infrastructure
  • Experience with chaos engineering principles and tools
  • Familiarity with traffic management and CDN technologies
  • Systems or backend engineering background
This job is filled or no longer available