Summary

Join GoFundMe as a Site Reliability Engineer (SRE) and be responsible for the full system lifecycle, from infrastructure provisioning to incident response. You will work with development teams, operations teams, and engineers to ensure high application performance and availability. The role involves designing and building cloud infrastructure (AWS), participating in performance analysis and capacity planning, and managing the platform's availability, scalability, security, and performance. You will diagnose bottlenecks, implement monitoring enhancements, and proactively improve infrastructure. On-call duties are required. The position is located in San Diego, CA, with an in-office requirement of 2-3 days per week.

Requirements

3+ years of experience in operating high-traffic SaaS environments
Deep expertise in the mentality, processes, and tools needed to deliver high availability
Skills to build a fully automated, highly elastic cloud orchestration framework on AWS
Experience running containerized infrastructure in Production (Kubernetes using EKS, AWS ECS)
Experience implementing configuration management and automation solutions using Infrastructure as Code, CI/CD and GitOps (Ansible, Terraform, ArgoCD, Github Actions)
Strong working knowledge of Linux and its underlying components, system statistics, performance tuning, filesystems and IO
Solid scripting skills (e.g. Bash, Python)
Experience with performance diagnostics, performance tuning, capacity planning, and monitoring
BS in Computer Science or equivalent
Good verbal and written communication skills

Responsibilities

Design and build out our cloud infrastructure (we run everything in AWS)
Participate in software and system performance analysis, tuning, and service capacity planning
Manage the availability, scalability, security, and performance of our platform and applications
Diagnose bottlenecks for the full stack and provide recommendations to overcome the bottlenecks as an interim work around, while long-term solutions are investigated
Periodically assess all monitoring requirements and implement enhancements to meet or exceed changing business needs
Proactively review, recommend, and implement changes to the live infrastructure after ensuring the right validation has been carried out
Work across engineering to improve SLO/SLI framework
Use data analysis to pick up trends before they become major problems
Perform 24/7 on-call duties

Preferred Qualifications

Building PCI compliant systems
Working with infrastructure for payment processing systems
Developing high-volume transaction systems
Passion for building fault tolerant and secure platforms

Benefits

Competitive Benefits : Enjoy competitive pay and comprehensive healthcare benefits
Holistic Support : Enjoy financial assistance for things like hybrid work, family planning, along with generous parental leave, flexible time-off policies, and mental health and wellness resources to support your overall well-being
Growth Opportunities : Participate in learning, development, and recognition programs to help you thrive and grow

Senior Site Reliability Engineer

GoFundMe.org

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

DevOps

Senior

ServiceNow

Remote

DevOps

Senior

Remote

DevOps

Senior