Staff Software Engineer

Reddit Logo

Reddit

πŸ’΅ $230k-$322k
πŸ“Remote - United States

Summary

Join Reddit's Machine Learning Platform team as a Staff Software Engineer and play a pivotal role in architecting, implementing, and maintaining the foundational Machine Learning Training infrastructure. You will optimize GPU utilization in AI/ML batch workloads, build systems for MLEs and data scientists, and continuously improve the ML software development lifecycle. This completely remote-friendly position requires 8+ years of experience in software development or building data systems and expertise in areas such as XLA/torch.inductor, distributed training optimization, large-scale ML systems, and MLOps. You will lead infrastructure development, propose high-performance solutions, analyze system bottlenecks, and mentor team members. The role offers a competitive salary, equity, and a comprehensive benefits package.

Requirements

  • 8+ years of work experience in a production software development environment or building data systems
  • Experience with XLA for Tensorflow or torch.inductor for pytorch for kernel fusion during training
  • Experience with optimization of data workloads using collosal.AI or Deepspeed
  • Experience with distributed Training optimization using deepspeed, horovod or collosalAI
  • Experience with design and architecture of large scale ML Systems
  • Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
  • Experience with MLOps practices and tools such as Ray and MLFlow
  • Hands-on experience with Kubernetes, Docker, or other container orchestration systems

Responsibilities

  • Optimize model training on GPUs
  • Lead the building, testing, and maintenance of ML infrastructure at Reddit
  • Propose, design, and implement high-performance ML platform solutions that significantly advance the deployment of models that serve millions of redditors a seamless experience for MLEs
  • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
  • Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
  • Work with management on team goal setting, planning, and de-risk project execution
  • Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k Match
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Reddit Global Days off
  • Generous paid Parental Leave
  • Paid Volunteer time off

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.