Remote HPC Network Engineer
CoreWeave
Job highlights
Summary
CoreWeave is seeking a highly skilled HPC Network Engineer to join their fast-growing team. The role involves monitoring, troubleshooting, supporting, deploying, and configuring InfiniBand fabrics. The ideal candidate should be proficient in InfiniBand configuration and management, network architectures, topologies, Linux system administration, and at least one scripting language. Preferred skills include experience with Nvidia UFM, SLURM job scheduler, Grafana, HPC systems architecture, MPI implementations, automation and configuration management tools such as Ansible, open-source technologies pertinent to HPC administration, and familiarity with various MPI implementations. The compensation ranges from $160,000-$210,000, and the role requires attendance at onboarding training in New Jersey with subsequent quarterly travel requirements of 1 week duration.
Requirements
- Proficient in InfiniBand configuration and management
- Solid understanding of network architectures, topologies, best practices, and techniques for high performance and availability
- Familiarity with optical networking hardware
- Experience in Linux system administration
- Proficiency in at least one scripting language
- Team player with effective collaboration skills
- Ability to manage multiple tasks and projects concurrently
Responsibilities
- Monitoring the performance and overall health of InfiniBand fabrics
- Troubleshooting various issues that may arise within InfiniBand fabrics
- Providing assistance and collaboration to other teams involved in the management and operation of HPC clusters utilizing InfiniBand technology
- Help with installation of large fabrics, organizing and working with teams to bring up fabrics from day 0 to operational fabrics together with onsite personnel and customers
- Work with configuration tooling, operations teams to carry out maintenance and upgrades of switches and the control plane of the fabrics
Preferred Qualifications
- Hands-on experience with Nvidia UFM
- Familiarity working with SLURM job scheduler
- Experience or familiarity with Grafana for monitoring and visualization
- Insight into HPC systems architecture and operational workflows
- Familiarity with various MPI implementations
- Experience with automation and configuration management tools such as ansible
- Acquaintance with open-source technologies pertinent to HPC administration, including resource management, storage systems, monitoring infrastructure, software deployment, and continuous integration
Benefits
- Medical, dental and vision insurance - 100% paid for the employee
- Company paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Tuition Reimbursement
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our offices
- Weekly massages in NJ office
- A casual work environment
- Work culture focused on innovative disruption
Share this job:
Similar Remote Jobs
- πUnited States
- π°$60k-$120kπTaiwan, China
- πWorldwide
- πJapan
- π°$160k-$182k
- πUnited States
- π°$150k-$180kπUnited States
- π°$75k-$110kπUnited States
- πUnited States