Site Reliabilty Engineer

Float.com Logo

Float.com

💵 $133k
📍Remote - United States

Summary

Join Float's growing SRE team as their third Site Reliability Engineer, working alongside the QA team to automate processes, improve visibility across engineering, and ensure reliability as the company scales. You will play a high-impact role in establishing stronger SLAs and enhancing customer experience. Responsibilities include maintaining and validating Kubernetes infrastructure upgrade processes, improving service hygiene by removing unnecessary alerts, partnering with engineers on service integration and migrations, and optimizing Kubernetes service usage. Further projects involve leading the exploration and implementation of service mesh options, defining incident response playbooks, supporting the next-gen data layer, and coaching teams on defining and meeting reliability goals. This fully remote role requires strong skills in Bash scripting, a programming language (ideally PHP, NodeJS, or Python), Kubernetes, Terraform, and GCP, along with excellent written communication and an iterative mindset. The company offers a competitive salary of US$133,000 (Level 2) and a supportive remote work environment.

Requirements

  • Confident writing scripts in Bash and proficient in at least one go-to language (ideally PHP, NodeJS, or Python)
  • Strong production experience managing and optimising Kubernetes clusters
  • Solid understanding of infrastructure as code using Terraform
  • Familiarity with Google Cloud Platform, or eagerness to get up to speed quickly
  • You believe in shipping value early and improving over time, not chasing one-shot perfection
  • You write clearly and concisely, whether it's documenting infrastructure, proposing changes, or sharing learnings across teams
  • Previous remote experience and are comfortable using tools like Slack, Loom, and Linear to communicate as needed

Responsibilities

  • Maintain and validate the processes that keep our Kubernetes infrastructure up-to-date, ensuring upgrades happen smoothly, safely, and regularly
  • Remove noisy, unused, or misfiring boot alerts and improve the team's ability to trust alerts as meaningful signals
  • Partner with engineers to configure services within our clusters and support service migrations where possible
  • Review and optimise usage across Kubernetes services, including right-sizing scale node specifications
  • Lead our exploration and implementation of service mesh options and harden ingress layers to defend against spam and abuse
  • Define and roll out standardised playbooks to improve clarity and speed during production incidents
  • Build deep familiarity with our next-gen data layer (CDC) to support new teams building on top of it
  • Help teams define, measure, and meet reliability goals—enabling engineering to own quality into production and drive better outcomes for customers

Benefits

  • Pay for this role is US $133,000 (Level 2)
  • We’re a global async remote company with a diverse team of people from all over the world who share a common belief in living our best work life
  • We believe deeply in the idea of transparency and share our Float Handbook publicly so potential new team members can see first hand our perks & benefits as well as our ways of working
  • Don’t worry—you will have significant deep work time since we have very few meetings

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.