Senior Site Reliability Engineer

Runwise
Summary
Join Runwise, a fast-paced climate-tech startup, as a Senior Site Reliability Engineer (Sr. SRE). You will maintain the stability and performance of our services, ensuring reliability, scalability, and fault tolerance. Collaborate with engineers to build and maintain tools improving system reliability and efficiency. Responsibilities include designing and maintaining scalable infrastructure, automating workflows, building monitoring systems, collaborating with development teams, participating in on-call rotations, defining SLOs/SLIs, conducting capacity planning, and advocating for engineering best practices. This role requires 5+ years of experience in SRE, DevOps, or infrastructure roles, proven success managing production systems in cloud environments, experience with infrastructure-as-code tools, strong scripting skills, and familiarity with CI/CD practices. Runwise offers a competitive salary, comprehensive benefits, and a hybrid work environment.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure-focused roles
- Proven success managing production systems in cloud environments like AWS, with a strong understanding of scalability and fault tolerance
- Experience using infrastructure-as-code tools like AWS CloudFormation and Ansible to manage and automate deployments
- Strong scripting or development skills in Python, Go, and Bash for building tools and automating workflows
- Hands-on experience with observability and alerting systems like Prometheus, Grafana, or CloudWatch
- Deep familiarity with CI/CD practices and tools, especially GitHub Actions, and a track record of improving build and release automation
- Comfort participating in on-call rotations and managing incident response, including postmortems and service recovery
- Ability to collaborate effectively across remote, distributed teams, with strong asynchronous communication and documentation habits
- A proactive mindset with a focus on continuous improvement, resilience, and customer impact
- Excitement about working in a fast-paced climate-tech company making a measurable environmental difference
Responsibilities
- Design and maintain scalable infrastructure in AWS cloud and distributed on-prem systems
- Automate infrastructure provisioning, deployment pipelines, and operational workflows using tools like Terraform, Ansible, or Helm
- Build and improve monitoring, alerting, and observability systems (e.g., Cloud Health, Grafana)
- Collaborate with development teams to improve service reliability, performance, and scalability
- Participate in on-call rotation and manage incident response, including root cause analysis and postmortems
- Define and track service-level objectives (SLOs) and service-level indicators (SLIs)
- Conduct capacity planning, chaos testing, and disaster recovery exercises
- Advocate for engineering best practices across CI/CD, security, and fault tolerance
Preferred Qualifications
Additional experience with distributed IoC systems is a huge plus
Benefits
- Medical, dental, and vision insurance
- HSA & FSA options
- Paid Parental Leave
- Access to Talkspace & Health Advocate
- Flexible PTO
- Commuter Benefits
- 401K
- Company-paid life insurance
- Voluntary supplemental life insurance
- Free in-office lunch on Wednesdays
- Hybrid work environment
- Summer Fridays
- Monthly L&D Series
- Employee Resource Groups (e.g. DEIB Committee, Run Club)
Share this job:
Similar Remote Jobs


