Site Reliability Engineer, Technical Referent

dLocal
Summary
Join dLocal, a global payments company, and become a key member of our team building and maintaining observability pipelines. You will design, implement, and optimize data ingestion strategies for logs, metrics, and traces. This role involves empowering engineering teams with self-service tools, supporting incident management, collaborating across teams, and automating observability infrastructure. You will define observability standards, own technical and security health, and optimize alerting systems. dLocal offers a flexible, remote-first culture with various benefits, including remote work options, flexible schedules, a referral bonus program, learning and development opportunities, language classes, a social budget, and unique opportunities like dLocal Houses.
Requirements
- Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability
- Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices
- Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization
- Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog
- Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar)
- Experience integrating incident management platforms (PagerDuty, Jira) with automated alerting workflows
- Strong scripting abilities (Python, Go, or similar) for automating observability tasks
- A problem-solving mindset, with the ability to collaborate across multi-functional teams to drive reliability improvements
Responsibilities
- Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies to balance cost, performance, and usability
- Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team. Drive adoption of best practices while ensuring teams own their telemetry
- Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident
- Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements. Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development
- Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines
- Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level
- Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs
- Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure them are always actionable, reducing fatigue and improving response efficiency
Preferred Qualifications
- Cloud experience, especially AWS and ECS-based workloads
- Experience managing observability pipelines at scale in high-throughput environments
- Familiarity with Configuration-as-Code (Ansible, Chef, or SaltStack) for managing configurations across legacy instances
- Database performance monitoring experience, particularly in large-scale distributed environments
Benefits
- Remote work: work from anywhere or one of our offices around the globe!*
- Flexibility: we have flexible schedules and we are driven by performance
- Fintech industry: work in a dynamic and ever-evolving environment, with plenty to build and boost your creativity
- Referral bonus program: our internal talents are the best recruiters - refer someone ideal for a role and get rewarded
- Learning & development: get access to a Premium Coursera subscription
- Language classes: we provide free English, Spanish, or Portuguese classes
- Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections!
- DLocal Houses: want to rent a house to spend one week anywhere in the world coworking with your team? We’ve got your back!
Share this job:
Similar Remote Jobs
