Understanding and Analyzing NVIDIA GPU Topology in Linux
A comprehensive guide on using nvidia-smi to inspect GPU topology and a deep dive into the meaning of topology identifiers (NODE, SYS, PHB, etc.) to optimize multi-GPU communication.
When training large-scale deep learning models or deploying multi-GPU inference, the communication efficiency between GPUs directly impacts overall performance. Understanding the GPU Topology helps us optimize process binding (CPU Affinity) and data transfer paths, avoiding unnecessary PCIe bottlenecks.
π How to View GPU Topology?
In a Linux environment with NVIDIA drivers installed, you can use the following command to view the detailed GPU topology matrix:
nvidia-smi topo -m
This command outputs a matrix showing the connection type between every pair of GPUs and their affinity with the CPU.
π Topology Matrix Example Analysis
Suppose we see the following output:
| GPU0 | GPU1 | GPU2 | CPU Affinity | NUMA Affinity | GPU NUMA ID | |
|---|---|---|---|---|---|---|
| GPU0 | X | NODE | NODE | 0-31 | 0 | N/A |
| GPU1 | NODE | X | NODE | 0-31 | 0 | N/A |
| GPU2 | NODE | NODE | X | 0-31 | 0 | N/A |
Analysis:
In this example, the connections between GPU0, GPU1, and GPU2 are all marked as NODE. This means these three cards are located within the same NUMA node and are connected via the same set of PCIe Host Bridges, resulting in low communication latency.
π Deep Dive into Topology Identifiers (Legend)
Understanding these abbreviations is key to performance tuning. Here is the detailed explanation:
| Identifier | Meaning | Detailed Explanation |
|---|---|---|
| X | Self | The GPU itself. |
| NV# | NVLink | Highest Speed. Connection via a bonded set of NVLinks, bypassing the PCIe protocol for maximum bandwidth and minimum latency. |
| PIX | PCIe Single Bridge | High Speed. Connection traverses at most a single PCIe bridge; a very short path. |
| PXB | PCIe Multiple Bridges | Medium Speed. Connection traverses multiple PCIe bridges but does not reach the PCIe Host Bridge. |
| PHB | PCIe Host Bridge | Medium/Low Speed. Connection must traverse a PCIe Host Bridge (typically the CPU’s internal PCIe controller). |
| NODE | NUMA Node | Medium Speed. Connection traverses PCIe as well as the interconnect between PCIe Host Bridges within a single NUMA node. |
| SYS | System | Slowest. Connection traverses PCIe and the SMP interconnect (e.g., Intel UPI or AMD Infinity Fabric) between different NUMA nodes. |
π‘ Performance Optimization Tips
- Prioritize NVLink: If the topology shows
NV#, it is the fastest path and should be the first choice for distributed training. - Avoid SYS Cross-Node Communication: If you see
SYS, data must travel across physical CPU sockets, which introduces the highest latency. Try to bind processes to CPU cores on the same NUMA node as the GPU. - CPU Affinity Binding: Use tools like
numactlto bind processes to the correspondingNUMA Affinityto significantly reduce memory access latency.
Summary: By mastering the hardware layout via nvidia-smi topo -m, you can make the most rational decisions regarding process distribution in multi-GPU environments, fully unleashing your hardware’s potential.