Senior Software Engineer, Site Reliability Engineering at Spreedly

Summary

Join Spreedly as a Senior Software Engineer, Site Reliability Engineering (SRE), and enhance the reliability, performance, and scalability of our globally distributed payments platform. You will focus on the application layer, collaborating with product and platform engineers to implement effective monitoring, resolve performance issues, and improve system resiliency. This role involves designing, implementing, and improving observability systems, leading root cause analysis and incident resolution, diagnosing and resolving application-level bottlenecks, optimizing databases, and collaborating with cross-functional teams. You will also build developer tools to automate processes and mentor other engineers. This is a high-impact role with significant visibility across the engineering organization.

Requirements

5+ years in SRE or related software engineering roles, with direct experience supporting production services at scale
Proficiency in a modern programming language (Ruby, Rails, and Elixir experience are preferred)
Hands-on experience with observability tooling (Datadog, OpenTelemetry, Sentry, etc.)
Experience with AWS services, such as EC2 (Ubuntu Linux), S3, and RDS
Knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning. Experience with Kafka is a plus
Experience supporting incident response and postmortems in high-stakes environments
Prior work developing and improving SLIs/SLOs and leading uptime initiatives in customer-facing systems
Understanding of software design patterns to support scalability and fault-tolerance
Experience mentoring other engineers and advocating for best practices
Application-focused SRE who has worked on monoliths and complex service architectures

Responsibilities

Application Observability & Monitoring: Design, implement, and improve observability systems using Datadog, OpenTelemetry, and other tools to proactively detect and resolve system issues
Incident Management: Lead root cause analysis, incident resolution, and response rotation (~every 10–12 weeks), with a bias toward prevention and measurable reliability improvements
Performance Engineering: Diagnose and resolve application-level bottlenecks in Ruby on Rails and Elixir codebases, and partner with engineering teams to deliver SLIs/SLOs
Database Optimization: Identify and fix query and indexing inefficiencies in PostgreSQL and CockroachDB
Cross-Team Collaboration: Serve as a reliability partner to product and infrastructure teams, coaching on reliability principles and embedding SRE best practices
Tooling & Automation: Build developer tools to automate deployment, monitoring, and diagnostics across production systems

Benefits

Competitive salary + Equity
Outstanding Medical and Dental benefits, including 100% employer-paid options
Company-paid Life and Disability insurance
Optional vision and supplemental insurance options, and various Flexible Spending Accounts (FSA)
Open Paid Time Off policy + 12 weeks of paid leave for new parents
Matching 401(k) plan (5% up to $5,000 yearly)
$1,000 annual professional development stipend
Monthly home working/digital lifestyle stipend, new MacBook, and one-time accessory reimbursement
Access to company-paid professional coaching service
Visits to HQ in Durham, North Carolina for remote employees

Senior Software Engineer, Site Reliability Engineering

Spreedly

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Senior

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Abnormal Security

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Stack AV

Remote

DevOps

Senior