Site Reliability Engineer

Deel Logo

Deel

πŸ“Remote

Summary

Join Deel's Production Sandbox Reliability team as a Junior to Mid-level Site Reliability Engineer, acting as the first line of defense for enterprise customer sandboxes. You will ensure these environments remain stable, up-to-date, and well-monitored. This role involves hands-on infrastructure work and close collaboration with engineering and customer-facing teams. The ideal candidate thrives in high-ownership environments and wants to grow their SRE skills. Deel offers a dynamic, globally distributed team and a chance to make a meaningful impact on the future of work. The company boasts impressive growth and recognition, making it a career accelerator. Deel is committed to inclusivity and offers competitive compensation and benefits.

Requirements

  • 1-3 years of experience in SRE, DevOps or Infrastructure Engineering roles
  • Experience with Node.js or Go
  • Familiarity with AWS cloud services (EKS, S3, RDS)
  • Hand-on experience with Kubernetes, including Helm and ArgoCD
  • Experience with observability stacks: Datadog, Grafana, Mimir, Loki, Tempo, Zabbix
  • Strong verbal and written communication skills – able to interface effectively with both technical and non-technical stakeholders
  • Self-starter mindset with an eye for operational excellence and continuous improvement

Responsibilities

  • Maintain reliability: Monitor the health of enterprise customer sandbox environments and ensure high availability, uptime, and stability across all services
  • Stay up-to-date: Regularly roll out updates to microservices inside each sandbox to ensure alignment with the latest versions
  • Alert response & escalation: Triage infrastructure and application alerts, perform initial investigation and escalate incidents to the appropriate engineering teams with clear context
  • Improve observability: Enhance metrics, logs and tracing coverage using Datadog and the Grafana stack (Mimir, Loki, Tempo), identifying gaps and driving better alerting practices
  • Support incident workflows: Collaborate in post-incident reviews and ensure root cause analysis is followed up with actionable items and improvements by relevant teams
  • Communicate proactively: Act as the bridge between internal engineering teams and customer-facing teams, providing timely updates during incidents, maintenance and version upgrades
  • Participate in on-call rotation: Provide continuous coverage across APAC, EMEA and LATAM time zones as part of a rotating on-call schedule (follow the sun)

Benefits

  • Stock grant opportunities dependent on your role, employment status and location
  • Additional perks and benefits based on your employment status and country
  • The flexibility of remote work, including optional WeWork access

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.