πTaiwan
Senior Hardware Reliability Engineer
closed
CoreWeave
π΅ $160k-$220k
πRemote - United States
Summary
Join CoreWeave, a leading AI hyperscaler, as a highly skilled GPU and PCIe troubleshooting Engineer. You will be a crucial part of the Hardware Engineering team, contributing to the design, development, troubleshooting, and optimization of server hardware infrastructure. Collaborate with cross-functional teams and vendors to deliver high-performance hardware solutions. This role requires expertise in GPU and PCIe technologies, automation, and server hardware management. CoreWeave offers a competitive salary, comprehensive benefits, and a hybrid work environment with flexibility for remote work options. The company is committed to fostering an inclusive and supportive workplace.
Requirements
- Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)
- Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish)
- Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools
- In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices
- Proven ability to stay updated with the latest industry technologies and trends
- Previous experience collaborating with hardware vendors
- Strong passion for automation, with a commitment to automating processes comprehensively
- Excellent documentation skills and attention to detail
- Strong analytical and problem-solving abilities
- Applicants must have work authorization that does not require sponsorship from the company now or in the future
Responsibilities
- Troubleshoot complex GPU and PCIe related failures
- Partner with external vendors on failure analysis
- Track component RMAs
- Develop and maintain hardware/firmware management services
- Automate all aspects of the server hardware lifecycle
- Serve as the senior point of contact for hardware escalation and troubleshooting
- Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture
- Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results
- Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency
- Establish processes for internal hardware testing, deployment, and performance optimization
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations
- A casual work environment
- A work culture focused on innovative disruption
- Hybrid work environment with flexibility for remote work
This job is filled or no longer available
Similar Remote Jobs
πChina
πSingapore
πAustralia
πUnited States
πUnited Kingdom, United States
π°$125k-$175k
πUnited States
πUnited States
π°$140k-$165k
πWorldwide
π°$146k-$207k
πUnited States