Research Engineer - Data

Leonardo.Ai
Summary
Join Leonardo.Ai, a Canva company, and become a Research Engineer β Data, architecting and managing petascale data pipelines for world-class AI models. You will collaborate with researchers to create and curate large, multi-modal datasets, including synthetic data. Your expertise in distributed systems and data processing will be crucial. Responsibilities include data acquisition and curation, developing high-performance data pipelines, generating synthetic data, conducting experiments, ensuring data security and compliance, and contributing to open-source projects. The role offers a flexible work environment and opportunities for professional growth within a diverse and inclusive culture.
Requirements
- Have hands-on experience with images, videos, 3D geometry (mesh/solid modeling), and/or text data
- Have well-rounded expertise in Python and PyTorch
- Demonstrate proficiency in setting up large-scale, robust data pipelines, using frameworks like Spark, Ray, or Metaflow
- Be comfortable with model versioning, and experiment tracking
- Have a good understanding of parallel and distributed computing
- Be experienced with setting up evaluation methods
- Have experience with AWS, Azure, or other cloud platforms
- Be proficient in both relational (MySQL, PostgreSQL) and NoSQL (MongoDB, Cassandra) databases, plus vector data stores
Responsibilities
- Lead the ingestion, unification, and organization of large, unstructured data sources (e.g., text, images, 3D geometry, code snippets) into scalable, high-quality datasets suitable for machine learning research and production
- Develop and optimize distributed systems for data processing, including filtering, indexing, and retrieval, leveraging frameworks like Ray, Metaflow, Spark, or Hadoop
- Build and orchestrate pipelines to generate synthetic data at scale, advancing research on cost-efficient inference and training strategies
- Design and conduct experiments on dataset quality, scalability, and performance
- Collaborate with legal and safety teams to ensure all data usage respects privacy, security, and ethical standards
- Contribute to internal and external libraries or frameworks, sharing insights and breakthroughs with the wider AI community through publications or technical blogs
Preferred Qualifications
Have a passion for synthetic data generation making use of inference of pretrained models, 3D rendering engines, and/or other softwares
Benefits
- Flexible Work Environment: We understand the importance of work-life balance. Thrive personally and professionally with the option to work remotely or in our vibrant offices
- Empowering Growth: We invest in your development with continuous learning opportunities and clear pathways for career advancement tailored to your goals
Share this job:
Similar Remote Jobs

