Manager - SRE

SigFig Logo

SigFig

πŸ“Remote - India

Summary

Join SigFig's Infrastructure & DevOps team as a Manager SRE and lead a hands-on technical team supporting mission-critical systems. This role focuses on proactively improving system resilience, driving automation, and enhancing incident response. You will collaborate with engineering, SRE, and security teams to streamline deployment processes and ensure high-availability of services. The position requires managing and scaling infrastructure using various tools and acting as the first technical escalation point for production incidents. You will lead post-incident reviews and contribute to a growing incident knowledge base. SigFig offers competitive benefits including flexible PTO, wellness benefits, and a remote-first work environment.

Requirements

  • 7+ years of experience in SRE, DevOps, or Technical Operations roles
  • 2+ years in a leadership role managing global, distributed teams in a high-uptime environment
  • Proven experience with AWS, GCP, or Azure, and implementing infrastructure as code at scale
  • Strong scripting skills in Python, Bash or similar for automation and operational tooling
  • Deep understanding of observability and incident management best practices
  • Experience with CI/CD and deployment orchestration tools
  • Familiarity with containerized and microservices-based architectures
  • Passion for automation, reliability engineering, and continuous improvement
  • Excellent communication and leadership skills to coordinate across global teams

Responsibilities

  • Lead a global, distributed SRE/DevOps team operating in a 24/7 production environment
  • Develop and implement automation frameworks for self-healing, auto-remediation, and system optimization
  • Enhance monitoring and observability through tools like Splunk, Prometheus, and AI-powered alerting platforms
  • Improve CI/CD pipelines using Jenkins, GitHub Actions, ArgoCD, and drive continuous delivery at scale
  • Manage and scale infrastructure using Terraform, Kubernetes, Puppet, and similar tools
  • Act as the first technical escalation point for Level-2/L-3 troubleshooting of production incidents involving Linux servers, cloud networking, and Kubernetes clusters
  • Lead post-incident reviews, implement automated solutions for root cause issues, and contribute to a growing incident knowledge base
  • Collaborate cross-functionally with Engineering, Security, and Product to align reliability initiatives with business objectives
  • Establish and enforce SLOs and error budgets to continually raise system reliability standards

Preferred Qualifications

Previous experience in fintech or highly regulated environments is a plus

Benefits

  • Flexible PTO
  • Wellness benefit
  • Mobile/Internet subsidy
  • Employee Recognition Programs
  • Tax-friendly Compensation
  • Liberal Leave Policy
  • Medical cover for the family, including parents
  • Quarterly Wellness Benefit
  • WFH Allowance
  • Mobile/Internet subsidy (for smooth WFH experience)
  • Employee Referral Program
  • Employee Recognition Program

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.