Junior Site Reliability Engineer

Deel
Summary
Join Deel as a Site Reliability Engineer (SRE) and play a critical role in ensuring the high reliability, scalability, and performance of our systems. You will combine software engineering practices with operations to manage, scale, and optimize production services, prioritizing resilience and user experience. Working closely with various teams, you will proactively monitor, analyze, and resolve issues, directly impacting Deel's ability to innovate and maintain high-quality services for its global customers. This is a high-impact role at a rapidly growing company, offering significant career growth opportunities. You will be responsible for monitoring production systems, establishing SLAs, building automation tools, handling incidents, and collaborating with product teams. Deel offers a competitive compensation and benefits package.
Requirements
- 1+ years of relevant Site Reliability Engineering or production operations experience
- Basic cloud experience with AWS
- Experience with monitoring and troubleshooting production environments
- Knowledge and experience with alerting and monitoring tools (e.g. Datadog, Prometheus, Grafana, Loki, Zabbix - an advantage)
- Experience with Node.js
- Results-driven with strong commitment to complete tasks
Responsibilities
- Monitor production systems for performance and reliability issues, and respond to incidents swiftly (using tools like Datadog, Prometheus, Grafana, Loki, Zabbix)
- Establish, track, and maintain Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) for critical services
- Build and maintain automation scripts, tools, and dashboards to improve monitoring, alerting, and response times
- Handle incidents including: Triage & Escalation: Quickly assess incidents, mitigate impact, and escalate when necessary
- Root Cause Analysis: Conduct post-mortems and root cause analysis, implementing corrective measures to prevent recurrence with the teams
- Incident Documentation: Maintain thorough documentation for each incident, including timelines, impact, and resolution steps
- Analytics: Provide weekly, monthly and yearly analytics for incidents, problems, Mean Time To Repair, and uptime metrics
- Implement robust alerting and escalation protocols for production incidents, with escalation paths to minimize downtime. Validate alerts to ensure accurate thresholds before production use
- Analyze data from production systems for bottlenecks, performance issues, and optimization opportunities, using insights from dev/staging to identify trends and preemptively address potential issues
- Identify inefficiencies and bottlenecks in system processes and work to resolve them
- Assist in n building run books and recommend automated recovery procedures
- Integrate lessons learned from incidents and team retrospectives into workflows and processes
- Regularly collaborate with product teams to understand user needs, ensuring that improvements align with customer experience goals
Benefits
- Stock grant opportunities dependent on your role, employment status and location
- Additional perks and benefits based on your employment status and country
- The flexibility of remote work, including optional WeWork access