Staff Software Engineer, ML Training

Stack AV Logo

Stack AV

πŸ“Remote - United States, Worldwide

Summary

Join Stack's ML Training Team and contribute to the development of revolutionary AI and autonomous systems for the trucking industry. The team focuses on optimizing model training speed and scalability, ensuring 100% GPU utilization across various GPU counts. Responsibilities include setting up efficiency monitoring, working with customer teams for benchmarking and improvements, creating standardized APIs, optimizing data loaders and training data formats, and optimizing distributed training configurations. The ideal candidate possesses 5+ years of experience as a software engineer, ideally with experience in infrastructure, customer-facing products, or autonomous vehicles. Strong skills in ML platforms, scalable infrastructure, and a deep understanding of design tradeoffs are essential. While not mandatory, experience with ML models in autonomous vehicles and high-performance C++ are highly desirable.

Requirements

Experience: 5+ years as a SWE, ideally building infrastructure/customer facing product, experience in AV or robotics is also great

Responsibilities

  • Setup efficiency monitoring for all our training jobs to identify models that need improvement
  • Work with customer teams to benchmark/profile their jobs and make improvements
  • Create standardized APIs for stack-wide abstractions like training datasets, bulk inference jobs, evaluation metrics
  • Optimize dataloaders / training data formats to ensure high gpu utilization
  • Optimize distributed training configurations (network topologies, sharding strategies, pipelines, etc)

Preferred Qualifications

  • Experience with both ML Platforms and building ML-based applications (bonus point if you have modeling experience)
  • Experience building scalable, reliable infra at a fast-paced environment
  • Experience building or using ML infra built for a large number of customer teams
  • A deep understanding of design tradeoffs and ability to articulate those tradeoffs and work with others on getting alignment
  • Experience with building ML models or ML infra in the domains of autonomous vehicles, perception, and decision making (desirable but not required)
  • Experience with model training, model optimization, or large data processing pipelines
  • Machine Learning Expertise is preferred but not necessary
  • Knows how to push the GPU to its limit from Python to CUDA kernel level
  • Built the inference or training loop for a large model (ideally with LLM flavor)
  • Shipped ML products (NLP, computer vision, recommender systems, etc.) at scale to make business impact
  • Knows how to build low latency / high throughput batch or stream processing pipelines
  • Knows how to write (readable) high performance C++
  • Prior AV experience
  • High customer empathy, able to communicate with customers well
  • Comfortable reading papers / keeping up with SOTA ML literature

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs