Lead Cluster Operations Support Engineer

Thoughtworks
Summary
Join a team providing 24/7 support for clients using large GPU clusters (6,000+ contiguous GPUs) for managed post-training (MPT). This involves assisting with preparation, offering round-the-clock support during training, and ensuring optimal cluster utilization. The team operates across three time zones with established hand-off protocols. While expertise in infrastructure and cluster operations is crucial, a foundational understanding of machine learning is also necessary. You will contribute to accelerator development, assess model training readiness, and provide support, including rotating weekend shifts. The role demands collaboration with machine learning and infrastructure engineers, proactive problem-solving, and a high level of professionalism.
Requirements
- Deep expertise Kubernetes administration and debugging at scale
- Deep knowledge of managing large clusters with 1000s of nodes with K8s
- Knowledge of running training workloads on 1000s of GPUs
- Underlying Cloud: GCP, AWS, Azure
- Terraform / Pulumi, Helm Charts, Linux, other Infrastructure-as-code tools
- You will be part of a high value client facing white glove service, where a high level of professionalism is required
- You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way
- You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives
- You donβt shy away from risks or conflicts, instead you take them on and skillfully manage them
- You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work
- You enjoy influencing others and always advocate for technical excellence while being open to change when needed
- You have an insatiable curiosity and a drive to learn new things
Responsibilities
- Help shape and iterate this new white glove model training support service on large GPU clusters
- Work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers
- Contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc. This will probably involve a combination of Terraform/Pulumi, Helm Charts, Python and Shell Scripts
- Help assess the model training readiness and data preparation
- Provide model training support rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: GCP changed a configuration in GKE that affects the training
- Facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers
- Proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements
Preferred Qualifications
- Knowledge of working with the Lustre filesystem is a plus
- Knowledge of working with NVIDIA NeMo Framework (Docker image for model training)
- Knowledge of working with NVIDIA NeMo NIMs (Docker images for inference)
- Nice to have: Run:ai, TrueFoundry, Huggingface platform etc (can provide training)
- Knowledge of working with HPC technologies such as Slurm is a bonus
Benefits
Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys