Site Reliabilty Engineer at Float.com

Summary

Join Float's growing SRE team as their third Site Reliability Engineer, working alongside the QA team to automate processes, improve visibility across engineering, and ensure reliability as the company scales. You will play a high-impact role in establishing stronger SLAs and enhancing customer experience. Responsibilities include maintaining and validating Kubernetes infrastructure upgrade processes, improving service hygiene by removing unnecessary alerts, partnering with engineers on service integration and migrations, and optimizing Kubernetes service usage. Further projects involve leading the exploration and implementation of service mesh options, defining incident response playbooks, supporting the next-gen data layer, and coaching teams on defining and meeting reliability goals. This fully remote role requires strong skills in Bash scripting, a programming language (ideally PHP, NodeJS, or Python), Kubernetes, Terraform, and GCP, along with excellent written communication and an iterative mindset. The company offers a competitive salary of US$133,000 (Level 2) and a supportive remote work environment.

Requirements

Confident writing scripts in Bash and proficient in at least one go-to language (ideally PHP, NodeJS, or Python)
Strong production experience managing and optimising Kubernetes clusters
Solid understanding of infrastructure as code using Terraform
Familiarity with Google Cloud Platform, or eagerness to get up to speed quickly
You believe in shipping value early and improving over time, not chasing one-shot perfection
You write clearly and concisely, whether it's documenting infrastructure, proposing changes, or sharing learnings across teams
Previous remote experience and are comfortable using tools like Slack, Loom, and Linear to communicate as needed

Responsibilities

Maintain and validate the processes that keep our Kubernetes infrastructure up-to-date, ensuring upgrades happen smoothly, safely, and regularly
Remove noisy, unused, or misfiring boot alerts and improve the team's ability to trust alerts as meaningful signals
Partner with engineers to configure services within our clusters and support service migrations where possible
Review and optimise usage across Kubernetes services, including right-sizing scale node specifications
Lead our exploration and implementation of service mesh options and harden ingress layers to defend against spam and abuse
Define and roll out standardised playbooks to improve clarity and speed during production incidents
Build deep familiarity with our next-gen data layer (CDC) to support new teams building on top of it
Help teams define, measure, and meet reliability goals—enabling engineering to own quality into production and drive better outcomes for customers

Benefits

Pay for this role is US $133,000 (Level 2)
We’re a global async remote company with a diverse team of people from all over the world who share a common belief in living our best work life
We believe deeply in the idea of transparency and share our Float Handbook publicly so potential new team members can see first hand our perks & benefits as well as our ways of working
Don’t worry—you will have significant deep work time since we have very few meetings

Site Reliabilty Engineer

Float.com

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Mid-level

Share this job: