Senior Site Reliability Engineer
Spreedly
Job highlights
Summary
Join Spreedly as a Senior Site Reliability Engineer and ensure the reliability, observability, and scalability of our globally distributed payments platform. You will lead efforts to stabilize and optimize our infrastructure, build platform services, and champion best practices. Leverage your expertise in software development, infrastructure, and operations to ensure our applications and systems are reliable, scalable, and efficient. Work across the entire application stack, using a diverse range of tools and technologies to support our mission-critical system. This role requires strong experience in designing and operating highly available, scalable cloud architectures. You will also mentor team members and foster a culture of learning and collaboration.
Requirements
- Hands-on experience with Datadog, OpenTelemetry, Sentry, and Sumo Logic or similar monitoring and observability platforms, with a focus on actionable metrics and alerts
- Strong proficiency in a modern programming language, with a proven ability to write clean, maintainable, and efficient code
- Extensive experience with AWS services, including EC2 (Ubuntu Linux), S3, and RDS
- In-depth knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning
- Excellent problem-solving skills with experience diagnosing complex system issues in production environments
- Proven ability to work cross-functionally with product and application, infrastructure, and security engineering teams
- Strong understanding of DevOps practices, including CI/CD pipelines, configuration management, and infrastructure-as-code
- Strong written and verbal communication skills, with the ability to explain complex technical concepts to non-technical stakeholders
Responsibilities
- Ensure the reliability, availability, and performance of Spreedlyโs globally distributed payments platform, processing $4B monthly production systems through monitoring, automation, and continuous improvement
- Collaborate with development teams to improve the reliability and performance of Ruby on Rails and Elixir applications
- Implement and maintain robust observability solutions using Datadog and OpenTelemetry, enabling proactive identification alerting, and resolution of issues
- Lead incident response efforts by participating in a shared on-call rotation to maintain 24/7 system reliability, including root cause analysis, resolution, and implementing measures to prevent recurrence
- Develop and maintain automation tools to reduce manual intervention, streamline operations, and enhance developer productivity
- Monitor, analyze, and optimize the performance of relational databases, identifying and resolving bottlenecks to maintain data integrity and efficiency
- Lead by example, infusing modern SRE best practices and fostering a culture of reliability and performance within the engineering organization
- Provide technical guidance and mentorship to team members, fostering a culture of learning and collaboration
Preferred Qualifications
- Ruby, Rails, and Elixir experience are preferred
- Experience with Kafka is a plus
- Advanced knowledge of Docker and container orchestration best practices is a plus
Benefits
- Competitive salary + Equity
- Outstanding Medical and Dental benefits, including 100% employer-paid options
- Company-paid Life and Disability insurance
- Optional vision and supplemental insurance options, and various Flexible Spending Accounts (FSA)
- Open Paid Time Off policy + 12 weeks of paid leave for new parents
- Matching 401(k) plan (5% up to $5,000 yearly)
- Monthly home working/digital lifestyle stipend, new MacBook, and one-time accessory reimbursement
- LinkedIn Learning subscription
- Access to company-paid professional coaching service
- Visits to HQ in Durham, North Carolina for remote employees
Share this job:
Similar Remote Jobs
- ๐ฐ$60k-$120k๐Asia
- ๐United States
- ๐ฐ$127k-$249k๐United States
- ๐United States
- ๐Poland
- ๐Romania
- ๐Poland