Data Engineer at Kyivstar -

Summary

Join Kyivstar.Tech, a Ukrainian IT company, as a Data Engineer (NLP-Focused) to build and optimize data pipelines for their Ukrainian LLM and Kyivstar’s NLP initiatives. Design robust ETL/ELT processes for large-scale text and metadata management, enabling data scientists and ML engineers to develop cutting-edge language models. You will work at the intersection of data engineering and machine learning, ensuring reliable and scalable datasets and infrastructure tailored to NLP model training and evaluation in a Ukrainian context. This role offers a unique opportunity to shape the data foundation of a pioneering AI project in Ukraine, collaborating with NLP experts and utilizing modern big data technologies. You will be responsible for designing, developing, and maintaining data pipelines, implementing data processing workflows, and collaborating with data scientists and NLP engineers. The position requires strong programming skills in Python, experience with NLP packages, and familiarity with cloud platforms.

Requirements

3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms
A Bachelor’s or Master’s degree in Computer Science, Engineering, or related field is preferred
Experience supporting machine learning or analytics teams with data pipelines is a strong advantage
Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies)
Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given our project’s focus
Understanding of FineWeb2 or similar processing pipelines approach
Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems
Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows
Familiarity with building pipelines for unstructured data (text, logs) as well as structured data
Strong programming skills in Python for data manipulation and pipeline development
Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.)
Experience with SQL for querying and transforming data in relational databases
Knowledge of Bash or other scripting for automation tasks
Writing clean, maintainable code and using version control (Git) for collaborative development
Experience working with relational databases (e.g., PostgreSQL, MySQL) including schema design and query optimization
Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus
Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as our NLP applications may require embedding storage and fast similarity search
Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing
Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions
Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus
Knowledge of data quality assurance practices
Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing
An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks
Ability to work closely with data scientists and understand the requirements of machine learning projects
Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require
Good communication skills to document data workflows and to coordinate with team members across different functions

Responsibilities

Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information. Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity
Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to our language modeling efforts
Implementation of NLP/LLM specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising), detection and deletion of personal data
Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher
Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs
Automate data processing workflows and ensure their scalability and reliability. Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles
Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs. Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas
Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models. Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation
Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control
Manage data security, access, and compliance. Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources

Preferred Qualifications

Experience with distributed data processing frameworks (such as Apache Spark or Databricks) for large-scale data transformation, and with message streaming systems (Kafka, Pub/Sub) for real-time data pipelines
Familiarity with data serialization formats (JSON, Parquet) and handling of large text corpora
Deep experience in web scraping, using tools like Scrapy, Selenium, or Beautiful Soup, and handling anti-scraping challenges (rotating proxies, rate limiting)
Ability to parse and clean raw text data from HTML, PDFs, or scanned documents
Knowledge of setting up CI/CD pipelines for data engineering (using GitHub Actions, Jenkins, or GitLab CI) to test and deploy changes to data workflows
Experience with containerization (Docker) to package data jobs and with Kubernetes for scaling them is a plus
Experience with analytics platforms and BI tools (e.g., Tableau, Looker) used to examine the data prepared by the pipelines
Understanding of how to create and manage data warehouses or data marts for analytical consumption
Demonstrated ability to work independently in solving complex data engineering problems, optimising existing pipelines, and implementing new ones under time constraints
A proactive attitude to explore new data tools or techniques that could improve our workflows

Benefits

Office or remote – it’s up to you. You can work from anywhere, and we will arrange your workplace
Remote onboarding
Performance bonuses for everyone (annual or quarterly — depends on the role)
We train employees: with the opportunity to learn through the company’s library, internal resources, and programs from partners
Health and life insurance
Wellbeing program and corporate psychologist
Reimbursement of expenses for Kyivstar mobile communication

Data Engineer

Kyivstar

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

Data

Mid-level

Share this job:

Similar Remote Jobs

Remote

Data

Mid-level

CoEnterprise

Remote

Sales

Mid-level

Remote

Data

Mid-level

Remote

Data

Mid-level

Netskope

Remote

Data

Senior

Remote

Data

Mid-level

Remote

Data

Senior

Remote

Software Development

Senior

Included Health

Remote

Software Development

Senior

Netskope

Remote

Data

Senior