Site Reliability Engineering Leader

Logo of dLocal

dLocal

πŸ“Remote - Argentina

Job highlights

Summary

Join dLocal, a global payments company, as a Site Reliability Engineering (SRE) Engineer. You will design and implement highly resilient, scalable, and reliable systems for mission-critical applications used by major clients. This role involves developing quality gates, automating processes, influencing architectural decisions, and collaborating with various teams. You will work with monitoring tools, CI/CD pipelines, and security best practices. dLocal offers a flexible, remote-first culture with travel, health, and learning benefits.

Requirements

  • Over 3 years’ of experience as SRE Engineer or in a very similar role
  • Experience with monitoring tools such as New Relic, DataDog, Nagios
  • Experience working with tools such as Jira, PagerDuty and Confluence and integrating these tools with automated processing techniques (API integrations)
  • Experience with CI/CD tools such as Github Actions, Jenkins, Spinnaker, ArgoCD or similar
  • Knowledge of security best practices and infosec tooling. (You will be writing systems to monitor for breaches and insecurities.)
  • Strong communication skills
  • Problem-solving skills
  • Detail-oriented person
  • Highly analytical person
  • Ability to collaborate across multi-functional teams

Responsibilities

  • Develop quality gates based on production-level service level objectives (SLOs) to detect issues earlier in the development cycle
  • Automate build testing and validation using service-level indicators (SLIs) and SLOs
  • Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development
  • Design processes, playbooks and checklists for other engineers to follow during and after incidents
  • Write post mortems and perform technical after-action reviews to understand root cause and propose system improvements to reduce overall fault rates
  • Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements
  • Automate the provisioning of monitoring tools and rules with tools like Terraform and Ansible / Chef
  • Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level
  • Monitor both the technical health as well as the security health of dLocal infrastructure and systems
  • Optimize signal-to-noise ratio for alerting to ensure we receive only the alerts that are actionable and make sense

Preferred Qualifications

  • Cloud experience (AWS) is highly advantageous (as most systems will integrate with AWS at some level)
  • IaC experience with a tool like Terraform is highly advantageous
  • CaC experience with a tool like Ansible, Chef or Salt is highly advantageous
  • Database knowledge is highly advantageous (both in terms of how they perform and SQL syntax)

Benefits

Flexible, remote-first dynamic culture with travel, health, and learning benefits

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.