2026-05-19 Posts

Understanding and Analyzing NVIDIA GPU Topology in Linux

A comprehensive guide on using nvidia-smi to inspect GPU topology and a deep dive into the meaning of topology identifiers (NODE, SYS, PHB, etc.) to optimize multi-GPU communication.

When training large-scale deep learning models or deploying multi-GPU inference, the communication efficiency between GPUs directly impacts overall performance. Understanding the GPU Topology helps us optimize process binding (CPU Affinity) and data transfer paths, avoiding unnecessary PCIe bottlenecks.

πŸš€ How to View GPU Topology?

In a Linux environment with NVIDIA drivers installed, you can use the following command to view the detailed GPU topology matrix:

nvidia-smi topo -m

This command outputs a matrix showing the connection type between every pair of GPUs and their affinity with the CPU.

πŸ“Š Topology Matrix Example Analysis

Suppose we see the following output:

GPU0GPU1GPU2CPU AffinityNUMA AffinityGPU NUMA ID
GPU0XNODENODE0-310N/A
GPU1NODEXNODE0-310N/A
GPU2NODENODEX0-310N/A

Analysis: In this example, the connections between GPU0, GPU1, and GPU2 are all marked as NODE. This means these three cards are located within the same NUMA node and are connected via the same set of PCIe Host Bridges, resulting in low communication latency.

πŸ“– Deep Dive into Topology Identifiers (Legend)

Understanding these abbreviations is key to performance tuning. Here is the detailed explanation:

IdentifierMeaningDetailed Explanation
XSelfThe GPU itself.
NV#NVLinkHighest Speed. Connection via a bonded set of NVLinks, bypassing the PCIe protocol for maximum bandwidth and minimum latency.
PIXPCIe Single BridgeHigh Speed. Connection traverses at most a single PCIe bridge; a very short path.
PXBPCIe Multiple BridgesMedium Speed. Connection traverses multiple PCIe bridges but does not reach the PCIe Host Bridge.
PHBPCIe Host BridgeMedium/Low Speed. Connection must traverse a PCIe Host Bridge (typically the CPU’s internal PCIe controller).
NODENUMA NodeMedium Speed. Connection traverses PCIe as well as the interconnect between PCIe Host Bridges within a single NUMA node.
SYSSystemSlowest. Connection traverses PCIe and the SMP interconnect (e.g., Intel UPI or AMD Infinity Fabric) between different NUMA nodes.

πŸ’‘ Performance Optimization Tips

  1. Prioritize NVLink: If the topology shows NV#, it is the fastest path and should be the first choice for distributed training.
  2. Avoid SYS Cross-Node Communication: If you see SYS, data must travel across physical CPU sockets, which introduces the highest latency. Try to bind processes to CPU cores on the same NUMA node as the GPU.
  3. CPU Affinity Binding: Use tools like numactl to bind processes to the corresponding NUMA Affinity to significantly reduce memory access latency.

Summary: By mastering the hardware layout via nvidia-smi topo -m, you can make the most rational decisions regarding process distribution in multi-GPU environments, fully unleashing your hardware’s potential.