Cloud Site Reliability Engineer

Ryz Labs Logo

Ryz Labs

πŸ“Remote - Argentina

Summary

Join RYZ Labs as a Cloud SRE and contribute to the development of self-driving robotic food delivery carriers. This remote position, based in Argentina or Uruguay, requires balancing hands-on SRE tasks with technical leadership. You will build and maintain critical tooling, guide architecture decisions, mentor colleagues, and drive initiatives to enhance system resilience. Collaboration with engineering, product, and operations teams is crucial to meet stringent uptime and performance goals. The role demands expertise in cloud technologies, SRE best practices, and strong communication skills. RYZ offers a dynamic environment with opportunities for growth and learning within a remote, distributed team.

Requirements

  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Demonstrated success implementing SRE best practices in high-availability, large-scale systems
  • Experience with one or more major cloud providers (e.g., Google Cloud, AWS, Azure); familiarity with managed services and best practices for high availability
  • Proficiency in Docker, Kubernetes, or similar containerization/orchestration platforms
  • Hands-on experience with logging, metrics, and tracing tools (e.g., Prometheus, Grafana, Datadog, Splunk, New Relic)
  • Familiarity with Infrastructure-as-Code (Terraform, Ansible, etc.) and scripting (Python, Go, Bash)
  • Proven ability to guide teams in adopting SRE principles without direct managerial authority
  • Excellent communication skills to work across diverse technical and business teams
  • Strong analytical skills to navigate complex systems and identify root causes
  • Comfortable operating in a fast-paced environment with shifting priorities

Responsibilities

  • Collaborate with development teams to design and implement solutions that ensure high availability in the cloud
  • Lead the definition and management of SLIs and SLOs aligned with business objectives
  • Perform capacity planning, load testing, and performance tuning
  • Develop monitoring and observability tools to validate system availability and performance
  • Implement best practices for instrumentation with tools like Prometheus, Grafana, or Datadog
  • Own the incident response process and root cause analysis
  • Identify and mitigate reliability risks to reduce downtime
  • Facilitate postmortems to capture learnings and drive continuous improvement
  • Advise teams on reliability-oriented design and development practices
  • Mentor engineers to foster a culture of continuous learning and operational excellence

Benefits

  • Remote work
  • Opportunities for growth and learning
  • Working with a team of great professionals and specialists

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.