Oleg Zabluda's blog
Saturday, March 25, 2017
 
Fast Multi-GPU collectives with NCCL
Fast Multi-GPU collectives with NCCL
"""
NCCL (pronounced “Nickel”) is a library of multi-GPU collective communication primitives that are topology-aware [...] NCCL can be deployed in single-process or multi-process applications, handling required inter-process communication transparently. Finally, the API will be very familiar to anyone with experience using MPI’s collectives.
[...]
For example, consider a broadcast of data from GPU0 to all other GPUs in the PCIe tree topology pictured below. A two-step tree algorithm is a common choice in this situation. In the first step the data is sent from the GPU0 to a second GPU, and in the second step both of these send data to the remaining processors. [...] To optimize Broadcast bandwidth, an even better approach is to treat the PCIe topology above as a ring. The broadcast is then performed by relaying small chunks of the input around the ring from GPU0 to GPU3. Interestingly, ring algorithms provide near optimal bandwidth for nearly all of the standard collective operations, even when applied to “tree-like” PCIe topologies.
"""
https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/

https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/

Labels:



Powered by Blogger