Summary
Join GoFundMe as a Site Reliability Engineer (SRE) and be responsible for the full system lifecycle, from infrastructure provisioning to incident response. You will work with development teams, operations teams, and engineers to ensure high application performance and availability. The role involves designing and building cloud infrastructure (AWS), participating in performance analysis and capacity planning, and managing the platform's availability, scalability, security, and performance. You will diagnose bottlenecks, implement monitoring enhancements, and proactively improve infrastructure. On-call duties are required. The position is located in San Diego, CA, with an in-office requirement of 2-3 days per week.
Requirements
- 3+ years of experience in operating high-traffic SaaS environments
- Deep expertise in the mentality, processes, and tools needed to deliver high availability
- Skills to build a fully automated, highly elastic cloud orchestration framework on AWS
- Experience running containerized infrastructure in Production (Kubernetes using EKS, AWS ECS)
- Experience implementing configuration management and automation solutions using Infrastructure as Code, CI/CD and GitOps (Ansible, Terraform, ArgoCD, Github Actions)
- Strong working knowledge of Linux and its underlying components, system statistics, performance tuning, filesystems and IO
- Solid scripting skills (e.g. Bash, Python)
- Experience with performance diagnostics, performance tuning, capacity planning, and monitoring
- BS in Computer Science or equivalent
- Good verbal and written communication skills
Responsibilities
- Design and build out our cloud infrastructure (we run everything in AWS)
- Participate in software and system performance analysis, tuning, and service capacity planning
- Manage the availability, scalability, security, and performance of our platform and applications
- Diagnose bottlenecks for the full stack and provide recommendations to overcome the bottlenecks as an interim work around, while long-term solutions are investigated
- Periodically assess all monitoring requirements and implement enhancements to meet or exceed changing business needs
- Proactively review, recommend, and implement changes to the live infrastructure after ensuring the right validation has been carried out
- Work across engineering to improve SLO/SLI framework
- Use data analysis to pick up trends before they become major problems
- Perform 24/7 on-call duties
Preferred Qualifications
- Building PCI compliant systems
- Working with infrastructure for payment processing systems
- Developing high-volume transaction systems
- Passion for building fault tolerant and secure platforms
Benefits
- Competitive Benefits : Enjoy competitive pay and comprehensive healthcare benefits
- Holistic Support : Enjoy financial assistance for things like hybrid work, family planning, along with generous parental leave, flexible time-off policies, and mental health and wellness resources to support your overall well-being
- Growth Opportunities : Participate in learning, development, and recognition programs to help you thrive and grow
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.