Senior Hardware Reliability Engineer

closed
CoreWeave Logo

CoreWeave

πŸ’΅ $160k-$220k
πŸ“Remote - United States

Summary

Join CoreWeave, a leading AI hyperscaler, as a highly skilled GPU and PCIe troubleshooting Engineer. You will be a crucial part of the Hardware Engineering team, contributing to the design, development, troubleshooting, and optimization of server hardware infrastructure. Collaborate with cross-functional teams and vendors to deliver high-performance hardware solutions. This role requires expertise in GPU and PCIe technologies, automation, and server hardware management. CoreWeave offers a competitive salary, comprehensive benefits, and a hybrid work environment with flexibility for remote work options. The company is committed to fostering an inclusive and supportive workplace.

Requirements

  • Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)
  • Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish)
  • Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools
  • In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices
  • Proven ability to stay updated with the latest industry technologies and trends
  • Previous experience collaborating with hardware vendors
  • Strong passion for automation, with a commitment to automating processes comprehensively
  • Excellent documentation skills and attention to detail
  • Strong analytical and problem-solving abilities
  • Applicants must have work authorization that does not require sponsorship from the company now or in the future

Responsibilities

  • Troubleshoot complex GPU and PCIe related failures
  • Partner with external vendors on failure analysis
  • Track component RMAs
  • Develop and maintain hardware/firmware management services
  • Automate all aspects of the server hardware lifecycle
  • Serve as the senior point of contact for hardware escalation and troubleshooting
  • Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture
  • Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results
  • Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency
  • Establish processes for internal hardware testing, deployment, and performance optimization

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
  • Hybrid work environment with flexibility for remote work
This job is filled or no longer available