πNew Zealand
Cloud Site Reliability Engineer
![Ryz Labs Logo](https://cdn.jobscollider.com/logo/ryzlabs-d804-0.webp)
Ryz Labs
πRemote - Argentina
Please let Ryz Labs know you found this job on JobsCollider. Thanks! π
Summary
Join RYZ Labs as a Cloud SRE and contribute to the development of self-driving robotic food delivery carriers. This remote position, based in Argentina or Uruguay, requires balancing hands-on SRE tasks with technical leadership. You will build and maintain critical tooling, guide architecture decisions, mentor colleagues, and drive initiatives to enhance system resilience. Collaboration with engineering, product, and operations teams is crucial to meet stringent uptime and performance goals. The role demands expertise in cloud technologies, SRE best practices, and strong communication skills. RYZ offers a dynamic environment with opportunities for growth and learning within a remote, distributed team.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role
- Demonstrated success implementing SRE best practices in high-availability, large-scale systems
- Experience with one or more major cloud providers (e.g., Google Cloud, AWS, Azure); familiarity with managed services and best practices for high availability
- Proficiency in Docker, Kubernetes, or similar containerization/orchestration platforms
- Hands-on experience with logging, metrics, and tracing tools (e.g., Prometheus, Grafana, Datadog, Splunk, New Relic)
- Familiarity with Infrastructure-as-Code (Terraform, Ansible, etc.) and scripting (Python, Go, Bash)
- Proven ability to guide teams in adopting SRE principles without direct managerial authority
- Excellent communication skills to work across diverse technical and business teams
- Strong analytical skills to navigate complex systems and identify root causes
- Comfortable operating in a fast-paced environment with shifting priorities
Responsibilities
- Collaborate with development teams to design and implement solutions that ensure high availability in the cloud
- Lead the definition and management of SLIs and SLOs aligned with business objectives
- Perform capacity planning, load testing, and performance tuning
- Develop monitoring and observability tools to validate system availability and performance
- Implement best practices for instrumentation with tools like Prometheus, Grafana, or Datadog
- Own the incident response process and root cause analysis
- Identify and mitigate reliability risks to reduce downtime
- Facilitate postmortems to capture learnings and drive continuous improvement
- Advise teams on reliability-oriented design and development practices
- Mentor engineers to foster a culture of continuous learning and operational excellence
Benefits
- Remote work
- Opportunities for growth and learning
- Working with a team of great professionals and specialists
Share this job:
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Similar Remote Jobs
![Smile Digital Health Logo](https://cdn.jobscollider.com/logo/smiledigitalhealth.com-68db-1.webp)
πWorldwide
πJapan
π°$60k-$120k
πAsia
πIndia
π°$177k-$190k
πUnited States
πSouth Africa
πCanada
πUnited States