Remote Site Reliability Engineer Technical Lead
closedNethermind
πRemote - EU
Job highlights
Summary
Join a team of builders and researchers on a mission to empower enterprises and developers worldwide to access and build on decentralized systems. We're seeking an experienced Site Reliability Engineer to lead and mentor our SRE team.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps
- Expert knowledge of cloud platforms (AWS, GCP)
- Expert knowledge of Kubernetes
- Proven experience in designing and implementing scalable, efficient, resilient systems
- Deep understanding of Linux/Unix systems and networking protocols
- Strong programming skills in Python or Go
- Strong background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki)
- Expertise in CI/CD tools (e.g. GitHub Actions, ArgoCD)
- Excellent communication skills, both written and verbal, with the ability to explain complex technical concepts to various audiences
- Experience in producing technical documentation, runbooks, presentations, and post-mortem reports
- Experience and passion for mentoring and upskilling team members
Responsibilities
- Lead the implementation and refinement of SRE practices across the organization, including SLOs, error budgets, and blameless postmortems
- Design and implement automation to eliminate toil and improve system reliability and efficiency
- Lead initiatives and architect scalable hybrid cloud solutions for Web3 infrastructure
- Manage error budgets and make data-driven decisions about when to prioritize reliability vs. new features
- Drive SRE practices to ensure high availability, performance, and reliability under varying load conditions
- Collaborate closely with Platform engineering team to build reliability into services from the ground up
- Collaborate closely with Nethermindβs Infrastructure Leadership department to align SRE strategies with overall technical vision
- Drive the adoption of observability best practices and implement comprehensive monitoring systems
- Develop and maintain service level indicators (SLIs) and objectives (SLOs), working with product owners to define appropriate reliability targets
- Mentor team members in SRE practices and foster a culture of continuous learning
- Lead capacity planning efforts, using quantitative analysis to predict and address future scaling challenges
- Contribute to long-term technical roadmaps, balancing reliability concerns with product innovation
Preferred Qualifications
- Experience leading technical teams
- Contributions to open-source projects or thought leadership in SRE
- Familiarity with MLOps and big data technologies
- Knowledge of blockchain technology and infrastructure
- Experience with chaos engineering principles and tools
- Familiarity with traffic management and CDN technologies
- Systems or backend engineering background
This job is filled or no longer available
Similar Remote Jobs
- π°$151k-$297kπUnited States
- πSpain
- π°$129k-$220kπUnited States
- π°$177k-$213kπUnited States
- πJapan
- πMexico
- πUnited States
- πUnited Kingdom
- π°$100k-$202kπUnited States, Worldwide
- πIndia