Summary

Join Deel as a Site Reliability Engineer (SRE) and play a critical role in ensuring the high reliability, scalability, and performance of our systems. You will combine software engineering practices with operations to manage, scale, and optimize production services, prioritizing resilience and user experience. Working closely with various teams, you will proactively monitor, analyze, and resolve issues, directly impacting Deel's ability to innovate and maintain high-quality services for its global customers. This is a high-impact role at a rapidly growing company, offering significant career growth opportunities. You will be responsible for monitoring production systems, establishing SLAs, building automation tools, handling incidents, and collaborating with product teams. The position requires experience in SRE or production operations, cloud experience with AWS, and familiarity with monitoring and alerting tools.

Requirements

1+ years of relevant Site Reliability Engineering or production operations experience
Basic cloud experience with AWS
Experience with monitoring and troubleshooting production environments
Knowledge and experience with alerting and monitoring tools (e.g. Datadog, Prometheus, Grafana, Loki, Zabbix - an advantage)
Experience with Node.js
Results-driven with strong commitment to complete tasks

Responsibilities

Monitor production systems for performance and reliability issues, and respond to incidents swiftly (using tools like Datadog, Prometheus, Grafana, Loki, Zabbix)
Establish, track, and maintain Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) for critical services
Build and maintain automation scripts, tools, and dashboards to improve monitoring, alerting, and response times
Handle incidents including: Triage & Escalation: Quickly assess incidents, mitigate impact, and escalate when necessary
Root Cause Analysis: Conduct post-mortems and root cause analysis, implementing corrective measures to prevent recurrence with the teams
Incident Documentation: Maintain thorough documentation for each incident, including timelines, impact, and resolution steps
Analytics: Provide weekly, monthly and yearly analytics for incidents, problems, Mean Time To Repair, and uptime metrics
Implement robust alerting and escalation protocols for production incidents, with escalation paths to minimize downtime. Validate alerts to ensure accurate thresholds before production use
Analyze data from production systems for bottlenecks, performance issues, and optimization opportunities, using insights from dev/staging to identify trends and preemptively address potential issues
Identify inefficiencies and bottlenecks in system processes and work to resolve them
Assist in n building run books and recommend automated recovery procedures
Integrate lessons learned from incidents and team retrospectives into workflows and processes
Regularly collaborate with product teams to understand user needs, ensuring that improvements align with customer experience goals

Benefits

Stock grant opportunities dependent on your role, employment status and location
Additional perks and benefits based on your employment status and country
The flexibility of remote work, including optional WeWork access

Junior Site Reliability Engineer

Deel

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Entry Level

Share this job:

Similar Remote Jobs

Remote

DevOps

Entry Level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Senior

GoDaddy

Remote

DevOps

Mid-level