Member Of Technical Staff - Infrastructure & Data
![Moonvalley Logo](https://cdn.jobscollider.com/logo/moonvalley.com-2f51-1.webp)
Moonvalley
Summary
Join Moonvalley, a cutting-edge creative studio powered by advanced AI models, as an Infrastructure Engineer. You will play a crucial role in architecting and scaling our AI systems, managing GPU infrastructure, maintaining ETL pipelines, and overseeing our telemetry platform. This early-stage opportunity allows you to work with top AI talent, tackle challenging data problems, and contribute to the future of AI. You will manage and scale GPU infrastructure using Kubernetes and Terraform/Pulumi, maintain ETL pipelines using Spark/Ray/Airflow, and oversee the telemetry platform using Datadog, Grafana, and W&B. The role requires a passion for building petabyte-scale systems and the ability to balance urgent fixes with long-term solutions. Moonvalley offers fully remote or hybrid positions with occasional company meetings.
Requirements
- Passion for building petabyte-scale systems that enhance efficiency and productivity
- Ability to balance quick fixes for urgent needs with long-term, scalable solutions
- Strong prioritization skills in a fast-moving, high-impact environment
- Comfortable using open-source tools or developing custom solutions when needed
- A versatile generalist, eager to learn and adapt to new tools and systems
Responsibilities
- Manage, and scale GPU infrastructure (Kubernetes, Terraform / Pulumi)
- Maintain ETL pipelines (Spark / Ray / Airflow)
- Oversee the telemetry platform to monitor system health (Datadog, Grafana, W&B)
- Manage the code platform (GitHub, CI/CD, PyTorch, Python)
- Track and optimize assets like datasets, checkpoints, and compute resources
- Develop tools, documentation, and guidance for the team
- Windows client and server administration
- Build robust high-performance distributed training of large-scale transformer models across clusters of 1000-5000 GPUs
- Implement high-performance, multi-modal data pipelines capable of processing petabyte-scale datasets within hours
- Continuously evolving our infrastructure to stay ahead of cutting-edge AI advancements
- Scaling our infrastructure to handle the next order of magnitude in growth
Preferred Qualifications
- Experience with infrastructure for large-scale AI training
- Cluster Engineering: GPU infrastructure, Kubernetes expertise
- Data Engineering: Mastery of ETL pipelines
- Developer Advocacy: Improving workflows, documentation, and tool adoption
Benefits
All roles at Moonvalley are either fully remote by default or hybrid positions if specified
Share this job:
Similar Remote Jobs
![The Honor Foundation Logo](https://cdn.jobscollider.com/logo/the-honor-foundation-be9e.webp)