Senior Site Reliability Engineer at Supermetrics

Summary

Join Supermetrics' Infrastructure team as a Senior Site Reliability Engineer, working fully remotely from Brazil. This full-time role, contracted through a local EOR, requires fluency in English and involves a 2-3 week onboarding period at our Helsinki HQ. You will leverage your extensive software engineering experience and Kubernetes expertise to build and maintain critical infrastructure. Responsibilities include writing Terraform configurations, developing Golang tooling, managing Helm charts, responding to production incidents, and supporting the pre-sales team. The ideal candidate possesses a strong background in SRE, database operations, and cloud platforms (AWS/GCP).

Requirements

7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles
In-depth understanding of containers and experience operating Kubernetes clusters at scale
Experience operating databases in production
Proficient in database concepts with practical experience in both relational and NoSQL databases
In-depth knowledge of Linux systems and Terraform
In-depth experience and understanding of AWS and GCP
Solid understanding of modern observability practices and tools
Automation mindset with the ability to automate repetitive tasks using scripting languages such as Python or Bash
Collaborative approach to working with others
Willing to take on-call rotations during non-business hours
Good communication skills, in particular in writing (documentation, but able to write good PRs too)
Skilled problem-solving abilities with a keen interest in the tools, technologies and problems in this space
A developer background and the ability to write CLIs and other tools in Go, Python, or Rust
Security mindset with experience implementing security best practices in platform and operational contexts
Experience in creating and managing Helm charts
Expert knowledge of continuous integration and continuous deployment (CI/CD) systems and processes and experience developing and maintaining GitHub Actions

Responsibilities

Write Terraform configuration and modules that bootstrap a Kubernetes cluster, or review PRs with contributions from other members, making sure that our modules are truly reusable and well-defined, improving how we test and release them
Write (using Golang, for example) and maintain or improve our tooling, ensuring it facilitates platform utilization by engineering teams
Develop and maintain Helm charts for internal deployments and third-party software
Respond to an incident with our production environment
Support our pre-sales team and assist them in answering potential customers' questions on our architecture and how we guarantee data security or consistency or ensure uptime
Review an architecture change involving a new database and take part in the meetings discussing the pros and cons of such an approach
Rewrite a Github Action to improve how we deploy to Kubernetes using GitOps
Fix technical issues as they arise
Participate in our on-call rotations to provide support, respond to incidents, or handle internal users' questions

Preferred Qualifications

Experience as a Software Engineer with an extensive background in developing tools and services within a Platform Engineering team focused on building APIs, services, and CLIs to support development teams
Proficiency in Kubernetes, encompassing the enhancement of capabilities through Custom Resource Definitions (CRDs) and operators, alongside improving networking functions and expanding storage alternatives via Container Network Interfaces (CNIs) and Container Storage Interfaces (CSIs), while ensuring security with network policies, admission controls, security profiles, and runtime classes
Deep knowledge of observability practices, including advanced proficiency with PromQL, defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and configuring and deploying OpenTelemetry collectors. Skilled in setting up and managing Grafana Dashboards, operating Time Series Databases (e.g., VictoriaMetrics), and working with Elastic/OpenSearch
Background in Site Reliability Engineering (SRE), with practical experience both as a user and as an operator. Capable of developing, supporting, and maintaining the tools, processes, and infrastructure essential for collecting, analyzing, and optimizing metrics, logs, and traces to ensure the high performance and scalability of SaaS platforms

Senior Site Reliability Engineer

Supermetrics

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

DevOps

Senior

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Coalition, Inc.

Remote

DevOps

Senior

Remote

DevOps

Senior

Censys

Remote

DevOps

Senior

Remote

DevOps

Senior

SMG Swiss Marketplace Group

Remote

DevOps

Senior

Remote

DevOps

Senior