Summary

Join RYZ Labs as a Cloud SRE and contribute to the development of self-driving robotic food delivery carriers. This remote position, based in Argentina or Uruguay, requires balancing hands-on SRE tasks with technical leadership. You will build and maintain critical tooling, guide architecture decisions, mentor colleagues, and drive initiatives to enhance system resilience. Collaboration with engineering, product, and operations teams is crucial to meet stringent uptime and performance goals. The role demands expertise in cloud technologies, SRE best practices, and strong communication skills. RYZ offers a dynamic environment with opportunities for growth and learning within a remote, distributed team.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Demonstrated success implementing SRE best practices in high-availability, large-scale systems
Experience with one or more major cloud providers (e.g., Google Cloud, AWS, Azure); familiarity with managed services and best practices for high availability
Proficiency in Docker, Kubernetes, or similar containerization/orchestration platforms
Hands-on experience with logging, metrics, and tracing tools (e.g., Prometheus, Grafana, Datadog, Splunk, New Relic)
Familiarity with Infrastructure-as-Code (Terraform, Ansible, etc.) and scripting (Python, Go, Bash)
Proven ability to guide teams in adopting SRE principles without direct managerial authority
Excellent communication skills to work across diverse technical and business teams
Strong analytical skills to navigate complex systems and identify root causes
Comfortable operating in a fast-paced environment with shifting priorities

Responsibilities

Collaborate with development teams to design and implement solutions that ensure high availability in the cloud
Lead the definition and management of SLIs and SLOs aligned with business objectives
Perform capacity planning, load testing, and performance tuning
Develop monitoring and observability tools to validate system availability and performance
Implement best practices for instrumentation with tools like Prometheus, Grafana, or Datadog
Own the incident response process and root cause analysis
Identify and mitigate reliability risks to reduce downtime
Facilitate postmortems to capture learnings and drive continuous improvement
Advise teams on reliability-oriented design and development practices
Mentor engineers to foster a culture of continuous learning and operational excellence

Benefits

Remote work
Opportunities for growth and learning
Working with a team of great professionals and specialists

Cloud Site Reliability Engineer

Ryz Labs

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

ServiceNow

Remote

DevOps

Senior

Loadsmart

Remote

DevOps

Senior

Remote

DevOps

Mid-level