Remote SW Engineer

Logo of Rivos Inc.

Rivos Inc.

πŸ“Remote - United States, United Kingdom

Job highlights

Summary

Join our team to improve the Deep Learning ecosystem by designing and implementing highly optimized communication collectives libraries. You will work closely with hardware and software teams to ensure efficient data communication and synchronization across multiple AI accelerators in a distributed system.

Requirements

  • Strong understanding of GPU architectures (CUDA, AMD ROCm) and experience in GPU programming (CUDA, HIP, or similar)
  • Proficiency in designing and implementing parallel and distributed algorithms, particularly communication collectives
  • Experience with network interconnects (NVLink, PCIe, Infiniband, RDMA) and understanding of their performance implications
  • Hands-on experience with communication collectives libraries like UCC, NCCL, or MPI
  • Strong knowledge of concurrency, synchronization, and memory consistency models in multi-threaded and distributed environments
  • Experience with profiling and optimizing low-level performance (memory bandwidth, latency, throughput) on GPU architectures
  • Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.) and their use of communication collectives
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment

Responsibilities

  • Build-up communication components of an AI Software Stack
  • Port AI Software to run on a new H/W platform
  • Profile and tune of communications within AI applications
  • Design, develop, and optimize communication collectives (e.g., AllReduce, AllGather, Broadcast, ReduceScatter) for large-scale distributed computing and machine learning frameworks
  • Implement and optimize communication algorithms (ring, tree, butterfly, etc.) tailored for our architectures and multi-node clusters
  • Ensure low-latency, high-bandwidth communication across multi-GPU setups, supporting interconnects such as PCIe and Infiniband
  • Collaborate with hardware engineers and other software teams to optimize performance
  • Implement fault tolerance and scalability mechanisms in distributed systems to handle large-scale workloads
  • Write unit tests and benchmark tools to validate the performance and correctness of collective operations
  • Stay current with advancements in hardware and networking technologies to continuously improve the library's performance

Preferred Qualifications

  • Network driver experience recommended
  • Excellent skills in problem solving, written and verbal communication
  • Strong organization skills, and highly self-motivated
  • Ability to work well in a team and be productive under aggressive schedules

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Rivos Inc. know you found this job on JobsCollider. Thanks! πŸ™