Member Of Technical Staff - Infrastructure & Data

Moonvalley Logo

Moonvalley

πŸ“Remote - Worldwide

Summary

Join Moonvalley, a cutting-edge creative studio powered by advanced AI models, as an Infrastructure Engineer. You will play a crucial role in architecting and scaling our AI systems, managing GPU infrastructure, maintaining ETL pipelines, and overseeing our telemetry platform. This early-stage opportunity allows you to work with top AI talent, tackle challenging data problems, and contribute to the future of AI. You will manage and scale GPU infrastructure using Kubernetes and Terraform/Pulumi, maintain ETL pipelines using Spark/Ray/Airflow, and oversee the telemetry platform using Datadog, Grafana, and W&B. The role requires a passion for building petabyte-scale systems and the ability to balance urgent fixes with long-term solutions. Moonvalley offers fully remote or hybrid positions with occasional company meetings.

Requirements

  • Passion for building petabyte-scale systems that enhance efficiency and productivity
  • Ability to balance quick fixes for urgent needs with long-term, scalable solutions
  • Strong prioritization skills in a fast-moving, high-impact environment
  • Comfortable using open-source tools or developing custom solutions when needed
  • A versatile generalist, eager to learn and adapt to new tools and systems

Responsibilities

  • Manage, and scale GPU infrastructure (Kubernetes, Terraform / Pulumi)
  • Maintain ETL pipelines (Spark / Ray / Airflow)
  • Oversee the telemetry platform to monitor system health (Datadog, Grafana, W&B)
  • Manage the code platform (GitHub, CI/CD, PyTorch, Python)
  • Track and optimize assets like datasets, checkpoints, and compute resources
  • Develop tools, documentation, and guidance for the team
  • Windows client and server administration
  • Build robust high-performance distributed training of large-scale transformer models across clusters of 1000-5000 GPUs
  • Implement high-performance, multi-modal data pipelines capable of processing petabyte-scale datasets within hours
  • Continuously evolving our infrastructure to stay ahead of cutting-edge AI advancements
  • Scaling our infrastructure to handle the next order of magnitude in growth

Preferred Qualifications

  • Experience with infrastructure for large-scale AI training
  • Cluster Engineering: GPU infrastructure, Kubernetes expertise
  • Data Engineering: Mastery of ETL pipelines
  • Developer Advocacy: Improving workflows, documentation, and tool adoption

Benefits

All roles at Moonvalley are either fully remote by default or hybrid positions if specified

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs