Enabling P2P Communication on NVIDIA RTX Servers: From Driver Patching to Performance Verification

A detailed guide on how to enable Peer-to-Peer (P2P) communication for RTX GPUs by modifying NVIDIA kernel modules on Debian/Ubuntu systems, and verifying bandwidth using CUDA Samples.

When building multi-GPU servers, NVIDIA typically disables P2P (Peer-to-Peer) communication on consumer-grade RTX GPUs, reserving it as an exclusive feature for enterprise GPUs like the A100 or H100. However, by patching the kernel modules, we can force-enable this feature, significantly improving data exchange efficiency between multiple GPUs.

This guide provides a step-by-step walkthrough based on Debian 12/13, covering everything from base driver installation to source code patching and final bandwidth verification.

1. Environment Setup and Basic Configuration

Before starting, ensure your system has the necessary build tools and that the CPU’s IOMMU support is enabled.

1.1 Install Essential Tools

sudo apt update
sudo apt install -y curl gnupg

1.2 Enable IOMMU (DMA Passthrough)

P2P communication relies on correct IOMMU configuration. Edit the GRUB configuration file:

sudo nano /etc/default/grub

Locate the GRUB_CMDLINE_LINUX_DEFAULT line and add the parameters based on your CPU:

AMD CPU: amd_iommu=on iommu=pt
Intel CPU: intel_iommu=on iommu=pt

Save and exit, then update GRUB and reboot:

sudo update-grub
sudo reboot

1.3 Install Build Dependencies

sudo apt install -y dkms build-essential gcc make linux-headers-$(uname -r) pkg-config libglvnd-dev

2. Installing the NVIDIA Driver Base

To ensure the patch is applied correctly, we need a driver base version that exactly matches the patched source (using 595.71.05 as an example).

2.1 Configure NVIDIA Repository

Download and install the Keyring package:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

Note (Debian 13 Users): If you encounter SHA1 restriction errors, edit /etc/apt/sources.list.d/cuda-debian12-x86_64.list and change [arch=amd64] to [trusted=yes].

2.2 Install Drivers

sudo apt update
sudo apt install nvidia-driver-pinning-595.71.05 \
                 nvidia-open=595.71.05-1 \
                 nvidia-kernel-open-dkms=595.71.05-1
sudo reboot

3. Kernel Module Patching (The Core Step)

This is the crucial part. We will replace the official kernel source with a patched version that supports P2P on RTX GPUs.

3.1 Preparation

First, switch to a multi-user target (CLI mode) to avoid driver occupancy by the GUI:

sudo systemctl isolate multi-user.target
sudo systemctl stop nvidia-persistenced
sudo rmmod nvidia_drm nvidia_modeset nvidia_uvm nvidia

3.2 Inject Patched Source Code

Clone the repository containing the P2P-enabled kernel modules:

cd ~
git clone https://github.com/aikitoria/open-gpu-kernel-modules.git

Backup and replace the official source:

cd /usr/src/nvidia-595.71.05/
sudo mkdir -p backup
sudo mv kernel-open src backup/ # Adjust based on actual directory structure

cd ~/open-gpu-kernel-modules/
sudo cp -r kernel-open src /usr/src/nvidia-595.71.05/

3.3 Recompile using DKMS

Force the reconstruction and installation of the modules via DKMS:

sudo dkms build -m nvidia -v 595.71.05 --force
sudo dkms install -m nvidia -v 595.71.05 --force

3.4 Load and Verify

sudo modprobe nvidia
nvidia-smi topo -p2p r

If the output shows the P2P status as OK, the patching was successful!

4. CUDA Performance Verification

To quantify the effect of P2P, we use the official NVIDIA p2pBandwidthLatencyTest.

4.1 Install CUDA

sudo apt-get update
sudo apt-get -y install cuda

4.2 Compile P2P Test Tool

git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples/cpp/5_Domain_Specific/p2pBandwidthLatencyTest
mkdir build && cd build

4.3 Fixing CMake Path Issues (Pro Tip)

In some CMake versions, the tool may fail to locate CUDA paths even after installation. We can “trick” CMake by creating a dummy directory structure:

sudo mkdir -p /usr/local/cuda/targets/x86_64-linux/include
sudo mkdir -p /usr/local/cuda/targets/x86_64-linux/lib

# Symlink actual system directories to the dummy structure
sudo ln -s /usr/include /usr/local/cuda/targets/x86_64-linux/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/targets/x86_64-linux/lib

Run the compilation:

cmake -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CXX_STANDARD=17 ..
make -j$(nproc)
./p2pBandwidthLatencyTest

5. Troubleshooting and FAQ

Environment Variable Configuration

If CUDA is not found when running the tool, add the following to your ~/.bashrc:

export CUDA_HOME=/usr/local/cuda-13.2
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/targets/x86_64-linux/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
export CUDA_PATH=$CUDA_HOME/targets/x86_64-linux
export CUDAToolkit_ROOT=$CUDA_HOME/targets/x86_64-linux

Restore Graphical Interface

After completing the tests, you can return to the GUI mode:

sudo systemctl isolate graphical.target