LLM Data Engineer at Halo Media

Summary

Join our team as an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform. The ideal candidate will have a strong background in data engineering, with a focus on Retrieval-Augmented Generation (RAG) and knowledge-base techniques.

Requirements

Master's degree in Computer Science, Data Science, or a related field
3-5 years of work experience in data engineering, preferably in AI/ML contexts
Proficiency in Python, JSON, HTTP, and related tools
Strong understanding of LLM architectures, training processes, and data requirements
Experience with RAG systems, knowledge base construction, and vector databases
Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts
Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated)
Knowledge of data crawling techniques and associated ethical considerations
Strong problem-solving skills and ability to work in a fast-paced, innovative environment
Familiarity with Snowflake and its integration in AI/ML pipelines
Experience with various vector store technologies and their applications in AI
Understanding of data lakehouse concepts and architectures
Excellent communication, collaboration, and problem-solving skills
Ability to translate business needs into technical solutions
Passion for innovation and a commitment to ethical AI development

Responsibilities

Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes
Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform
Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data
Benchmark and implement various vector stores, embedding techniques, and retrieval methods
Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search)
Implement and maintain auto-tagging systems and data preparation processes for LLMs
Develop tools for text and image data crawling, cleaning, and refinement
Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models
Work with data lake house architectures to optimize data storage and processing
Integrate and optimize workflows using Snowflake and various vector store technologies

Preferred Qualifications

Experience with popular LLM/ RAG frameworks
Familiarity with distributed computing platforms (e.g., Apache Spark, Dask)
Knowledge of data versioning and experiment tracking tools
Experience with cloud platforms (AWS, GCP, or Azure) for large-scale data processing
Understanding of data privacy and security best practices
Practical experience implementing data lakehouse solutions
Proficiency in optimizing queries and data processes in Snowflake or Databricks
Hands-on experience with different vector store technologies

Benefits

US employees benefit package

LLM Data Engineer

Halo Media

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

Data

Mid-level

Share this job:

Similar Remote Jobs

Pathway

Remote

Software Development

Mid-level

Remote

Data

Mid-level

Remote

Data

Mid-level

NBCUniversal

Remote

Data

Mid-level

3H Partners

Remote

Data

Mid-level

Remote

Software Development

Mid-level

IQ-EQ

Remote

Software Development

Senior

MagicSchool AI

Remote

Data

Senior

Remote

Data

Senior