Research Scientist at Deepgram

Summary

Join Deepgram, a leading voice AI platform, and pioneer the development of Latent Space Models (LSMs) to revolutionize voice AI. This research role focuses on building next-generation neural audio codecs, steerable generative models, and embedding systems to overcome challenges in data, scale, and cost. You will leverage latent recombination to generate synthetic audio data and train multimodal speech-to-speech systems. The ideal candidate possesses a strong mathematical foundation, deep expertise in foundation model architectures, and a proven ability to bridge theory and practice. This role demands obsession with problem-solving, creativity, and a relentless drive toward elegant, scalable solutions. Deepgram offers a unique opportunity to make a transformative impact in the AI industry.

Requirements

Possess a strong mathematical foundation in statistical learning theory, particularly in areas relevant to self-supervised and multimodal learning
Have deep expertise in foundation model architectures, with an understanding of how to scale training across multiple modalities
Demonstrate the ability to bridge theory and practice—someone who can both derive novel mathematical formulations and implement them efficiently
Show a demonstrated ability to build data pipelines that can process and curate massive datasets while maintaining quality and diversity
Have a track record of designing controlled experiments that isolate the impact of architectural innovations and validate theoretical insights
Possess experience optimizing models for real-world deployment, including knowledge of hardware constraints and efficiency techniques
Have a history of open-source contributions or research publications that have advanced the state of the art in speech/language AI

Responsibilities

Build next-generation neural audio codecs that achieve extreme, low bit-rate compression and high fidelity reconstruction across a world-scale corpus of general audio
Pioneer steerable generative models that can synthesize the full diversity of human speech from the codec latent representation, from casual conversation to highly emotional expression to complex multi-speaker scenarios with environmental noise and overlapping speech
Develop embedding systems that cleanly factorize the codec latent space into interpretable dimensions of speaker, content, style, environment, and channel effects -- enabling precise control over each aspect and the ability to massively amplify an existing seed dataset through “latent recombination”
Leverage latent recombination to generate synthetic audio data at previously impossible scales, unlocking joint model and data scaling paradigms for audio. Endeavor to train multimodal speech-to-speech systems that can 1) understand any human irrespective of their demographics, state, or environment and 2) produce empathic, human-like responses that achieve conversational or task-oriented objectives
Design model architectures, training schemes, and inference algorithms that are adapted for hardware at the bare metal enabling cost efficient training on billion-hour datasets and powering real-time inference for hundreds of millions of concurrent conversations

Research Scientist

Deepgram

Summary

Requirements

Responsibilities

Remote

All Others

Mid-level

Share this job:

Similar Remote Jobs

Remote

Data

Senior

Remote

Data

Mid-level

Remote

Data

Mid-level

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior

Canva

Remote

Data

Senior