Understanding and Analyzing NVIDIA GPU Topology in Linux

A comprehensive guide on using nvidia-smi to inspect GPU topology and a deep dive into the meaning of topology identifiers (NODE, SYS, PHB, etc.) to optimize multi-GPU communication.

When training large-scale deep learning models or deploying multi-GPU inference, the communication efficiency between GPUs directly impacts overall performance. Understanding the GPU Topology helps us optimize process binding (CPU Affinity) and data transfer paths, avoiding unnecessary PCIe bottlenecks.

🚀 How to View GPU Topology?

In a Linux environment with NVIDIA drivers installed, you can use the following command to view the detailed GPU topology matrix:

nvidia-smi topo -m

This command outputs a matrix showing the connection type between every pair of GPUs and their affinity with the CPU.

📊 Topology Matrix Example Analysis

Suppose we see the following output:

	GPU0	GPU1	GPU2	CPU Affinity	GPU NUMA ID
GPU0	X	NODE	NODE	0-31	N/A
GPU1	NODE	X	NODE	0-31	N/A
GPU2	NODE	NODE	X	0-31	N/A

Analysis: In this example, the connections between GPU0, GPU1, and GPU2 are all marked as NODE. This means these three cards are located within the same NUMA node and are connected via the same set of PCIe Host Bridges, resulting in low communication latency.

📖 Deep Dive into Topology Identifiers (Legend)

Understanding these abbreviations is key to performance tuning. Here is the detailed explanation:

Identifier	Meaning	Detailed Explanation
X	Self	The GPU itself.
NV#	NVLink	Highest Speed. Connection via a bonded set of NVLinks, bypassing the PCIe protocol for maximum bandwidth and minimum latency.
PIX	PCIe Single Bridge	High Speed. Connection traverses at most a single PCIe bridge; a very short path.
PXB	PCIe Multiple Bridges	Medium Speed. Connection traverses multiple PCIe bridges but does not reach the PCIe Host Bridge.
PHB	PCIe Host Bridge	Medium/Low Speed. Connection must traverse a PCIe Host Bridge (typically the CPU’s internal PCIe controller).
NODE	NUMA Node	Medium Speed. Connection traverses PCIe as well as the interconnect between PCIe Host Bridges within a single NUMA node.
SYS	System	Slowest. Connection traverses PCIe and the SMP interconnect (e.g., Intel UPI or AMD Infinity Fabric) between different NUMA nodes.

💡 Performance Optimization Tips

Prioritize NVLink: If the topology shows NV#, it is the fastest path and should be the first choice for distributed training.
Avoid SYS Cross-Node Communication: If you see SYS, data must travel across physical CPU sockets, which introduces the highest latency. Try to bind processes to CPU cores on the same NUMA node as the GPU.
CPU Affinity Binding: Use tools like numactl to bind processes to the corresponding NUMA Affinity to significantly reduce memory access latency.

Summary: By mastering the hardware layout via nvidia-smi topo -m, you can make the most rational decisions regarding process distribution in multi-GPU environments, fully unleashing your hardware’s potential.