Summary
Join Fastly's Technical Operations team as a Senior Site Reliability Engineer and play a key role in ensuring the reliability, performance, and scalability of our infrastructure. You will develop and enhance automation, refine monitoring, champion platform stability, and drive continuous improvement. This role requires expertise in Linux/Unix systems, proficiency in Go or Python, and experience with large-scale infrastructure. The position offers a hybrid or remote work option within the US, with preferred locations in San Francisco, New York, and Denver. Competitive salary, comprehensive benefits, and opportunities for professional growth are included.
Requirements
- Expertise in Linux/Unix systems with hands-on experience tuning, troubleshooting and operating systems at scale
- Proficiency in software development using Go or Python, including experience writing robust, maintainable, and efficient code for infrastructure automation
- Experience operating large-scale infrastructure in on-prem, cloud, or hybrid environments, with a focus on reliability, scalability, and automation
- End-to-end system knowledge, from design and provisioning to deployment, monitoring, and long-term operations
- Strong networking fundamentals, including TCP/IP, DNS, HTTP, and TLS, with practical experience debugging network issues
- Proven ability to manage and scale highly available, distributed systems, or the demonstrated capability to quickly ramp up in such environments
- Motivated to learn and adapt to a new tech stack, embracing unforeseen challenges with a problem-solving mindset
Responsibilities
- Develop and enhance automation and tooling to reduce manual toil across the Fastly fleet
- Refine monitoring and alarming to ensure focus on key metrics for optimal performance and customer stability
- Champion platform stability and customer reliability through cross-team collaboration
- Drive continuous improvement by learning from operational challenges and integrating insights into future plans
- Address large-scale challenges and optimize systems for efficiency and performance
Preferred Qualifications
- Experience with BGP and network routing in large-scale environments or deep knowledge of the Linux networking stack
- Hands-on Kubernetes experience or expertise with other container orchestration platforms in production environments
Benefits
- Medical, dental, and vision insurance
- Family planning, mental health support along with Employee Assistance Program
- Insurance (Life, Disability, and Accident)
- A Flexible Vacation policy and up to 18 days of accrued paid sick leave
- 401(k) (including company match)
- An Employee Stock Purchase Program
- 11 paid local holidays
- 11 paid company wellness days