Staff Software Engineer

Summary
Join Reddit's Machine Learning Platform team as a Staff Software Engineer and play a pivotal role in architecting, implementing, and maintaining the foundational Machine Learning Training infrastructure. You will optimize GPU utilization in AI/ML batch workloads, build systems for MLEs and data scientists, and continuously improve the ML software development lifecycle. This completely remote-friendly position requires 8+ years of experience in software development or building data systems and expertise in areas such as XLA/torch.inductor, distributed training optimization, large-scale ML systems, and MLOps. You will lead infrastructure development, propose high-performance solutions, analyze system bottlenecks, and mentor team members. The role offers a competitive salary, equity, and a comprehensive benefits package.
Requirements
- 8+ years of work experience in a production software development environment or building data systems
- Experience with XLA for Tensorflow or torch.inductor for pytorch for kernel fusion during training
- Experience with optimization of data workloads using collosal.AI or Deepspeed
- Experience with distributed Training optimization using deepspeed, horovod or collosalAI
- Experience with design and architecture of large scale ML Systems
- Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
- Experience with MLOps practices and tools such as Ray and MLFlow
- Hands-on experience with Kubernetes, Docker, or other container orchestration systems
Responsibilities
- Optimize model training on GPUs
- Lead the building, testing, and maintenance of ML infrastructure at Reddit
- Propose, design, and implement high-performance ML platform solutions that significantly advance the deployment of models that serve millions of redditors a seamless experience for MLEs
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
- Work with management on team goal setting, planning, and de-risk project execution
- Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k Match
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Reddit Global Days off
- Generous paid Parental Leave
- Paid Volunteer time off
Share this job:
Similar Remote Jobs
