Enabling P2P Communication on NVIDIA RTX Servers: From Driver Patching to Performance Verification
A detailed guide on how to enable Peer-to-Peer (P2P) communication for RTX GPUs by modifying NVIDIA kernel modules on Debian/Ubuntu systems, and verifying bandwidth using CUDA Samples.
When building multi-GPU servers, NVIDIA typically disables P2P (Peer-to-Peer) communication on consumer-grade RTX GPUs, reserving it as an exclusive feature for enterprise GPUs like the A100 or H100. However, by patching the kernel modules, we can force-enable this feature, significantly improving data exchange efficiency between multiple GPUs.
This guide provides a step-by-step walkthrough based on Debian 12/13, covering everything from base driver installation to source code patching and final bandwidth verification.
1. Environment Setup and Basic Configuration
Before starting, ensure your system has the necessary build tools and that the CPU’s IOMMU support is enabled.
1.1 Install Essential Tools
sudo apt update
sudo apt install -y curl gnupg
1.2 Enable IOMMU (DMA Passthrough)
P2P communication relies on correct IOMMU configuration. Edit the GRUB configuration file:
sudo nano /etc/default/grub
Locate the GRUB_CMDLINE_LINUX_DEFAULT line and add the parameters based on your CPU:
- AMD CPU:
amd_iommu=on iommu=pt - Intel CPU:
intel_iommu=on iommu=pt
Save and exit, then update GRUB and reboot:
sudo update-grub
sudo reboot
1.3 Install Build Dependencies
sudo apt install -y dkms build-essential gcc make linux-headers-$(uname -r) pkg-config libglvnd-dev
2. Installing the NVIDIA Driver Base
To ensure the patch is applied correctly, we need a driver base version that exactly matches the patched source (using 595.71.05 as an example).
2.1 Configure NVIDIA Repository
Download and install the Keyring package:
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
Note (Debian 13 Users): If you encounter SHA1 restriction errors, edit /etc/apt/sources.list.d/cuda-debian12-x86_64.list and change [arch=amd64] to [trusted=yes].
2.2 Install Drivers
sudo apt update
sudo apt install nvidia-driver-pinning-595.71.05 \
nvidia-open=595.71.05-1 \
nvidia-kernel-open-dkms=595.71.05-1
sudo reboot
3. Kernel Module Patching (The Core Step)
This is the crucial part. We will replace the official kernel source with a patched version that supports P2P on RTX GPUs.
3.1 Preparation
First, switch to a multi-user target (CLI mode) to avoid driver occupancy by the GUI:
sudo systemctl isolate multi-user.target
sudo systemctl stop nvidia-persistenced
sudo rmmod nvidia_drm nvidia_modeset nvidia_uvm nvidia
3.2 Inject Patched Source Code
Clone the repository containing the P2P-enabled kernel modules:
cd ~
git clone https://github.com/aikitoria/open-gpu-kernel-modules.git
Backup and replace the official source:
cd /usr/src/nvidia-595.71.05/
sudo mkdir -p backup
sudo mv kernel-open src backup/ # Adjust based on actual directory structure
cd ~/open-gpu-kernel-modules/
sudo cp -r kernel-open src /usr/src/nvidia-595.71.05/
3.3 Recompile using DKMS
Force the reconstruction and installation of the modules via DKMS:
sudo dkms build -m nvidia -v 595.71.05 --force
sudo dkms install -m nvidia -v 595.71.05 --force
3.4 Load and Verify
sudo modprobe nvidia
nvidia-smi topo -p2p r
If the output shows the P2P status as OK, the patching was successful!
4. CUDA Performance Verification
To quantify the effect of P2P, we use the official NVIDIA p2pBandwidthLatencyTest.
4.1 Install CUDA
sudo apt-get update
sudo apt-get -y install cuda
4.2 Compile P2P Test Tool
git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples/cpp/5_Domain_Specific/p2pBandwidthLatencyTest
mkdir build && cd build
4.3 Fixing CMake Path Issues (Pro Tip)
In some CMake versions, the tool may fail to locate CUDA paths even after installation. We can “trick” CMake by creating a dummy directory structure:
sudo mkdir -p /usr/local/cuda/targets/x86_64-linux/include
sudo mkdir -p /usr/local/cuda/targets/x86_64-linux/lib
# Symlink actual system directories to the dummy structure
sudo ln -s /usr/include /usr/local/cuda/targets/x86_64-linux/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/targets/x86_64-linux/lib
Run the compilation:
cmake -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CXX_STANDARD=17 ..
make -j$(nproc)
./p2pBandwidthLatencyTest
5. Troubleshooting and FAQ
Environment Variable Configuration
If CUDA is not found when running the tool, add the following to your ~/.bashrc:
export CUDA_HOME=/usr/local/cuda-13.2
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/targets/x86_64-linux/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
export CUDA_PATH=$CUDA_HOME/targets/x86_64-linux
export CUDAToolkit_ROOT=$CUDA_HOME/targets/x86_64-linux
Restore Graphical Interface
After completing the tests, you can return to the GUI mode:
sudo systemctl isolate graphical.target