Staff Software Engineer, ML Training at Stack AV

Summary

Join Stack's ML Training Team and contribute to the development of revolutionary AI and autonomous systems for the trucking industry. The team focuses on optimizing model training speed and scalability, ensuring 100% GPU utilization across various GPU counts. Responsibilities include setting up efficiency monitoring, working with customer teams for benchmarking and improvements, creating standardized APIs, optimizing data loaders and training data formats, and optimizing distributed training configurations. The ideal candidate possesses 5+ years of experience as a software engineer, ideally with experience in infrastructure, customer-facing products, or autonomous vehicles. Strong skills in ML platforms, scalable infrastructure, and a deep understanding of design tradeoffs are essential. While not mandatory, experience with ML models in autonomous vehicles and high-performance C++ are highly desirable.

Requirements

Experience: 5+ years as a SWE, ideally building infrastructure/customer facing product, experience in AV or robotics is also great

Responsibilities

Setup efficiency monitoring for all our training jobs to identify models that need improvement
Work with customer teams to benchmark/profile their jobs and make improvements
Create standardized APIs for stack-wide abstractions like training datasets, bulk inference jobs, evaluation metrics
Optimize dataloaders / training data formats to ensure high gpu utilization
Optimize distributed training configurations (network topologies, sharding strategies, pipelines, etc)

Preferred Qualifications

Experience with both ML Platforms and building ML-based applications (bonus point if you have modeling experience)
Experience building scalable, reliable infra at a fast-paced environment
Experience building or using ML infra built for a large number of customer teams
A deep understanding of design tradeoffs and ability to articulate those tradeoffs and work with others on getting alignment
Experience with building ML models or ML infra in the domains of autonomous vehicles, perception, and decision making (desirable but not required)
Experience with model training, model optimization, or large data processing pipelines
Machine Learning Expertise is preferred but not necessary
Knows how to push the GPU to its limit from Python to CUDA kernel level
Built the inference or training loop for a large model (ideally with LLM flavor)
Shipped ML products (NLP, computer vision, recommender systems, etc.) at scale to make business impact
Knows how to build low latency / high throughput batch or stream processing pipelines
Knows how to write (readable) high performance C++
Prior AV experience
High customer empathy, able to communicate with customers well
Comfortable reading papers / keeping up with SOTA ML literature

Staff Software Engineer, ML Training

Stack AV

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

Software Development

Senior

Share this job:

Similar Remote Jobs

Remote

Software Development

Senior

Remote

Software Development

Mid-level

Remote

Software Development

Mid-level

Remote

Software Development

Mid-level

Stack AV

Remote

Software Development

Mid-level

Remote

Software Development

Senior

Remote

Software Development

Mid-level

Remote

Software Development

Senior

Remote

Software Development

Mid-level