Senior Site Reliability Engineer

Underdog Fantasy Logo

Underdog Fantasy

πŸ’΅ $150k-$180k
πŸ“Remote - United States

Summary

Join Underdog, the fastest-growing sports gaming company, and become a key member of our team. You will own and maintain incident response processes, guide teams in establishing SLOs, lead capacity planning, and develop disaster recovery plans. Collaboration with various teams on architecture decisions and launch planning is crucial. You will act as an internal expert for monitoring tools and infrastructure, emphasizing automation. This role requires 6+ years of experience in SRE, cloud infrastructure, or web application development, strong communication skills, and a collaborative nature. Underdog offers a competitive salary, unlimited PTO, parental leave, a home office allowance, and comprehensive benefits.

Requirements

  • 6+ years of experience in site reliability engineering, cloud infrastructure, and/or web application development
  • A strong written and verbal communicator
  • Collaborative by nature
  • Someone who enjoys using research, data, and experiments to make decisions; you believe β€œHope is not a strategy.”
  • You enjoy working directly with customers (generally engineers or other people inside the company)
  • You think long-term about what is best for the business and its customers
  • You are excited to take ownership
  • You are very comfortable around an IDE, working with multiple languages, multiple web application frameworks, AWS services, Kubernetes, PostgreSQL
  • You can work independently to learn new languages/technologies as needed
  • You enjoy deploying changes to production quickly, multiple times a week if necessary

Responsibilities

  • Own and maintain the incident response process, including defining procedures, tools, and best practices
  • Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
  • Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
  • Develop and implement disaster recovery plans, including regular testing and regulatory compliance
  • Collaborate with teams on architecture decisions to ensure high availability and scalability
  • Manage launch and event planning for high-traffic occasions, focusing on infrastructure preparation and capacity management (a.k.a. Launch Readiness)
  • Act as an internal expert and consultant for monitoring tools like Datadog and Pagerduty and infrastructure like AWS and Kubernetes
  • Emphasis on automation and tooling to scale our workload
  • Jump in and out of repos written in languages like Ruby, Python, Go, Typescript, Swift, Kotlin, and SQL to support efforts described above

Preferred Qualifications

  • Experience with PostgreSQL SQL query optimization, tweaking autovacuum settings, table statistics, different index types, etc
  • Experience with Redis/Valley Optimization
  • Experience with Datadog or similar products
  • Experience working as a web application developer, frontend or backend, especially in React and Ruby on Rails
  • Experience with AWS cost optimization
  • Read the Google SRE books or similar books, or have other forms of SRE training
  • Actively leveraging the capabilities of AI to augment abilities and gain knowledge about interested domains

Benefits

  • Unlimited PTO (we're extremely flexible with the exception of the first few weeks before & into the NFL season)
  • 16 weeks of fully paid parental leave
  • A $500 home office allowance
  • A connected virtual first culture with a highly engaged distributed workforce
  • 5% 401k match, FSA, company paid health, dental, vision plan options for employees and dependents

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.