Operations Engineer
CoreWeave
Summary
Join CoreWeave's HPC Networking Team as a detail-oriented Operations Engineer. You will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics. This role requires at least one year of InfiniBand or similar networking technology experience, a solid understanding of networking concepts, Linux system administration experience, and proficiency in at least one scripting language. Preferred qualifications include experience with Nvidia UFM, SLURM, Grafana or Prometheus, Ansible, and data center operations. CoreWeave offers a competitive salary, comprehensive benefits including 100% employer-paid medical, dental, and vision insurance, paid parental leave, flexible PTO, and a hybrid work environment.
Requirements
- At least 1 year of experience with InfiniBand or similar networking technologies
- Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting
- Experience with Linux system administration and maintenance
- Proficiency in at least one scripting language
Responsibilities
- Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes
- Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks
- Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise
Preferred Qualifications
- Hands-on experience with Nvidia UFM or similar fabric management tools
- Familiarity with SLURM job scheduler and its role in HPC environments
- Experience with monitoring and visualization platforms such as Grafana or Prometheus
- Experience with operational tooling and automation frameworks like Ansible
- Knowledge of data center operations, including server racks, and cabling
- Python or Bash scripting
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations
- A casual work environment
- A work culture focused on innovative disruption
- Hybrid work environment
- Remote work options for candidates outside 30 miles of an office