LLM Data Engineer

Logo of Halo Media

Halo Media

πŸ“Remote - United States

Job highlights

Summary

Join our team as an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform. The ideal candidate will have a strong background in data engineering, with a focus on Retrieval-Augmented Generation (RAG) and knowledge-base techniques.

Requirements

  • Master's degree in Computer Science, Data Science, or a related field
  • 3-5 years of work experience in data engineering, preferably in AI/ML contexts
  • Proficiency in Python, JSON, HTTP, and related tools
  • Strong understanding of LLM architectures, training processes, and data requirements
  • Experience with RAG systems, knowledge base construction, and vector databases
  • Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts
  • Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated)
  • Knowledge of data crawling techniques and associated ethical considerations
  • Strong problem-solving skills and ability to work in a fast-paced, innovative environment
  • Familiarity with Snowflake and its integration in AI/ML pipelines
  • Experience with various vector store technologies and their applications in AI
  • Understanding of data lakehouse concepts and architectures
  • Excellent communication, collaboration, and problem-solving skills
  • Ability to translate business needs into technical solutions
  • Passion for innovation and a commitment to ethical AI development

Responsibilities

  • Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes
  • Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform
  • Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data
  • Benchmark and implement various vector stores, embedding techniques, and retrieval methods
  • Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search)
  • Implement and maintain auto-tagging systems and data preparation processes for LLMs
  • Develop tools for text and image data crawling, cleaning, and refinement
  • Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models
  • Work with data lake house architectures to optimize data storage and processing
  • Integrate and optimize workflows using Snowflake and various vector store technologies

Preferred Qualifications

  • Experience with popular LLM/ RAG frameworks
  • Familiarity with distributed computing platforms (e.g., Apache Spark, Dask)
  • Knowledge of data versioning and experiment tracking tools
  • Experience with cloud platforms (AWS, GCP, or Azure) for large-scale data processing
  • Understanding of data privacy and security best practices
  • Practical experience implementing data lakehouse solutions
  • Proficiency in optimizing queries and data processes in Snowflake or Databricks
  • Hands-on experience with different vector store technologies

Benefits

US employees benefit package

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Halo Media know you found this job on JobsCollider. Thanks! πŸ™