Remote Site Reliability Engineer Technical Lead

Logo of Nethermind

Nethermind

πŸ“Remote - EU

Job highlights

Summary

Join a team of builders and researchers on a mission to empower enterprises and developers worldwide to access and build on decentralized systems. We're seeking an experienced Site Reliability Engineer to lead and mentor our SRE team.

Requirements

  • 5+ years of experience in Site Reliability Engineering or DevOps
  • Expert knowledge of cloud platforms (AWS, GCP)
  • Expert knowledge of Kubernetes
  • Proven experience in designing and implementing scalable, efficient, resilient systems
  • Deep understanding of Linux/Unix systems and networking protocols
  • Strong programming skills in Python or Go
  • Strong background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki)
  • Expertise in CI/CD tools (e.g. GitHub Actions, ArgoCD)
  • Excellent communication skills, both written and verbal, with the ability to explain complex technical concepts to various audiences
  • Experience in producing technical documentation, runbooks, presentations, and post-mortem reports
  • Experience and passion for mentoring and upskilling team members

Responsibilities

  • Lead the implementation and refinement of SRE practices across the organization, including SLOs, error budgets, and blameless postmortems
  • Design and implement automation to eliminate toil and improve system reliability and efficiency
  • Lead initiatives and architect scalable hybrid cloud solutions for Web3 infrastructure
  • Manage error budgets and make data-driven decisions about when to prioritize reliability vs. new features
  • Drive SRE practices to ensure high availability, performance, and reliability under varying load conditions
  • Collaborate closely with Platform engineering team to build reliability into services from the ground up
  • Collaborate closely with Nethermind’s Infrastructure Leadership department to align SRE strategies with overall technical vision
  • Drive the adoption of observability best practices and implement comprehensive monitoring systems
  • Develop and maintain service level indicators (SLIs) and objectives (SLOs), working with product owners to define appropriate reliability targets
  • Mentor team members in SRE practices and foster a culture of continuous learning
  • Lead capacity planning efforts, using quantitative analysis to predict and address future scaling challenges
  • Contribute to long-term technical roadmaps, balancing reliability concerns with product innovation

Preferred Qualifications

  • Experience leading technical teams
  • Contributions to open-source projects or thought leadership in SRE
  • Familiarity with MLOps and big data technologies
  • Knowledge of blockchain technology and infrastructure
  • Experience with chaos engineering principles and tools
  • Familiarity with traffic management and CDN technologies
  • Systems or backend engineering background

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Nethermind know you found this job on JobsCollider. Thanks! πŸ™