Senior Software Engineer, Site Reliability Engineering

Spreedly
Summary
Join Spreedly as a Senior Software Engineer, Site Reliability Engineering (SRE), and enhance the reliability, performance, and scalability of our globally distributed payments platform. You will focus on the application layer, collaborating with product and platform engineers to implement effective monitoring, resolve performance issues, and improve system resiliency. This role involves designing, implementing, and improving observability systems, leading root cause analysis and incident resolution, diagnosing and resolving application-level bottlenecks, optimizing databases, and collaborating with cross-functional teams. You will also build developer tools to automate processes and mentor other engineers. This is a high-impact role with significant visibility across the engineering organization.
Requirements
- 5+ years in SRE or related software engineering roles, with direct experience supporting production services at scale
- Proficiency in a modern programming language (Ruby, Rails, and Elixir experience are preferred)
- Hands-on experience with observability tooling (Datadog, OpenTelemetry, Sentry, etc.)
- Experience with AWS services, such as EC2 (Ubuntu Linux), S3, and RDS
- Knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning. Experience with Kafka is a plus
- Experience supporting incident response and postmortems in high-stakes environments
- Prior work developing and improving SLIs/SLOs and leading uptime initiatives in customer-facing systems
- Understanding of software design patterns to support scalability and fault-tolerance
- Experience mentoring other engineers and advocating for best practices
- Application-focused SRE who has worked on monoliths and complex service architectures
Responsibilities
- Application Observability & Monitoring: Design, implement, and improve observability systems using Datadog, OpenTelemetry, and other tools to proactively detect and resolve system issues
- Incident Management: Lead root cause analysis, incident resolution, and response rotation (~every 10β12 weeks), with a bias toward prevention and measurable reliability improvements
- Performance Engineering: Diagnose and resolve application-level bottlenecks in Ruby on Rails and Elixir codebases, and partner with engineering teams to deliver SLIs/SLOs
- Database Optimization: Identify and fix query and indexing inefficiencies in PostgreSQL and CockroachDB
- Cross-Team Collaboration: Serve as a reliability partner to product and infrastructure teams, coaching on reliability principles and embedding SRE best practices
- Tooling & Automation: Build developer tools to automate deployment, monitoring, and diagnostics across production systems
Benefits
- Competitive salary + Equity
- Outstanding Medical and Dental benefits, including 100% employer-paid options
- Company-paid Life and Disability insurance
- Optional vision and supplemental insurance options, and various Flexible Spending Accounts (FSA)
- Open Paid Time Off policy + 12 weeks of paid leave for new parents
- Matching 401(k) plan (5% up to $5,000 yearly)
- $1,000 annual professional development stipend
- Monthly home working/digital lifestyle stipend, new MacBook, and one-time accessory reimbursement
- Access to company-paid professional coaching service
- Visits to HQ in Durham, North Carolina for remote employees
Share this job:
Similar Remote Jobs
