Senior Site Reliability Engineer

dLocal
Summary
Join dLocal, a global payments company, as a Site Reliability Engineer (SRE) focused on building and maintaining a centralized observability platform using OpenTelemetry. You will design, implement, and optimize data ingestion pipelines, empower engineering teams with self-service tools, support incident management, collaborate across teams, automate infrastructure, and define observability standards. This role requires extensive experience with Kubernetes, monitoring tools, IaC, and scripting languages. dLocal offers a remote-first, flexible work environment with various benefits including remote work options, flexible schedules, a referral bonus program, learning and development opportunities, language classes, a social budget, and dLocal Houses.
Requirements
- Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability
- Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices
- Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization
- Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog
- Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar)
- Experience integrating incident management platforms (PagerDuty, Jira) with automated alerting workflows
- Strong scripting abilities (Python, Go, or similar) for automating observability tasks
- A problem-solving mindset, with the ability to collaborate across multi-functional teams to drive reliability improvements
Responsibilities
- Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies to balance cost, performance, and usability
- Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team. Drive adoption of best practices while ensuring teams own their telemetry
- Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident
- Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements. Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development
- Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines
- Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level
- Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs
- Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure them are always actionable, reducing fatigue and improving response efficiency
Preferred Qualifications
- Cloud experience, especially AWS and ECS-based workloads
- Experience managing observability pipelines at scale in high-throughput environments
- Familiarity with Configuration-as-Code (Ansible, Chef, or SaltStack) for managing configurations across legacy instances
- Database performance monitoring experience, particularly in large-scale distributed environments
Benefits
- Remote work: work from anywhere or one of our offices around the globe!*
- Flexibility: we have flexible schedules and we are driven by performance
- Fintech industry: work in a dynamic and ever-evolving environment, with plenty to build and boost your creativity
- Referral bonus program: our internal talents are the best recruiters - refer someone ideal for a role and get rewarded
- Learning & development: get access to a Premium Coursera subscription
- Language classes: we provide free English, Spanish, or Portuguese classes
- Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections!
- DLocal Houses: want to rent a house to spend one week anywhere in the world coworking with your team? We’ve got your back!