Remote Support Operations Engineer at CoreWeave

Summary

The job is for a Support Operations Engineer at CoreWeave, a cloud provider specializing in GPU compute resources. The role involves monitoring fleet health, troubleshooting customer support requests, collaborating with other teams, and ensuring optimal performance of the clusters. The position requires knowledge of cloud computing, Linux, Kubernetes, Docker, and HPC/AI. Benefits include competitive salary, medical insurance, life insurance, flexible spending account, tuition reimbursement, mental wellness benefits, family-forming support, paid parental leave, flexible PTO, catered lunch, weekly massages (in NJ office), a casual work environment, and work culture focused on innovative disruption.

Requirements

A working knowledge of cloud computing, virtualization, and container technologies
A working knowledge of Linux - tell us about your favorite Linux distro
A working knowledge of Kubernetes and Docker
A prior role in Sysadmin, Site Reliability Engineering, DevOps, or Infrastructure Operations
A prior role in HPC/AI
A knack for solving problems - recognizing technical issues, developing appropriate solutions, and following through to completion
A love for creating documentation and processes to better your team’s internal knowledge base
An interest in building the world’s largest bespoke supercomputers for leading AI labs
A solid understanding of distributed computing environments and methodologies, such as storage volumes, private networks, load balancers, and virtual machines
Excellent communication skills (both written and verbal)
Willing to work in a very fast-paced environment with dynamic priorities and ever-changing developments
Highly independent engineer yet collaborates well as part of a team
Willingness and interest to travel to CoreWeave data centers as needed

Responsibilities

Monitor the fleet’s health, performance, and reliability for issues through the use of observability stack - Grafana, Prometheus, Victoria Metrics
Use CoreWeave Kubernetes to troubleshoot customer support requests and act as a technical escalation point for the Cloud Support Engineers
Learn from fellow Support Operation Engineer teammates and mentor junior engineers and new hires
Leverage knowledge of Linux (Ubuntu) to diagnose, troubleshoot, and rectify bugs across the fabric
Assist and collaborate with other teams involved in the management and operation of CoreWeave infrastructure
Offer expertise, guidance, and troubleshooting support to ensure the smooth functioning and optimal performance of the clusters
Support some of the world’s largest bare metal fleets of dedicated servers running the latest NVIDIA H100 GPU technology on Infiniband deployments
Work hand in hand with our Data Center Technicians to install, configure, and troubleshoot all aspects of data center infrastructure
Liaison with Cloud Operations to ensure that the CoreWeave platform is scalable, reliable and stable
Partner with our network engineers and software developers to collect failure logs, reproduce issues, and ultimately solve the world’s hardest problems
Identify, create, and maintain new documentation with our Technical Writing team of troubleshooting workflows, corner case scenarios, and new discoveries
Serve as a technical liaison on incidents and escalations, communicating with all stakeholders
Participate in a 24/7 on-call rotation every few months ensuring that mission-critical alerts are addressed for infrastructure resiliency
Develop alerting, telemetry, and new metrics to proactively prevent issues across the fleet and reduce need for reactive support

Preferred Qualifications

Prior experience with computer hardware or server hardware - did you build your own PC at home?
Prior experience in a data center as an engineer or a technician - what kind of servers did you work on?
Prior experience with NVIDIA GPUs and CUDA technologies
Prior experience with SuperMicro, Dell, HP Enterprise, and Gigabyte systems
Prior experience with HPC systems
Prior experience with AI / ML

Benefits

Medical, dental and vision insurance - 100% paid for the employee
Company paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Tuition Reimbursement
Mental Wellness Benefits through Spring Health
Family-forming support, including paid parental leave and flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our offices
Weekly massages in NJ office
A casual work environment
Work culture focused on innovative disruption

CoreWeave is hiring a Support Operations Engineer, Remote - United States

Support Operations Engineer closed

🏢 CoreWeave

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Similar Jobs

Support Operations Engineer

CoreWeave

Remote

DevOps

Support Operations Engineer

ConnectOS

Remote

All Others

Trading Platform Support Engineer

Zeal Group

Remote

DevOps

Systems Operations Engineer I

Ultra Mobile

Remote

DevOps

Technical Support Engineer

Venafi

Remote

Customer Service

Technical Support Engineer

Red Canary

Remote

Customer Service

Technical Support Engineer Intern

PingCAP

Remote

Internship

Customer Service

Associate Technical Support Engineer

Vestmark

Customer Service

Support Engineer - Internship

Climeworks

Internship

Customer Service

Senior Security Operations Engineer

Octopus Energy

Remote

Cybersecurity

CoreWeave is hiring a
Support Operations Engineer, Remote - United States