Compiling llama.cpp on Linux: Full Guide from CPU to CUDA Acceleration
A detailed guide on how to compile llama.cpp from source on Linux, covering basic CPU versions and NVIDIA GPU (CUDA) acceleration configuration steps. Includes complete compilation command reference.
For developers looking to run Large Language Models (LLMs) locally, llama.cpp is currently the most critical open-source inference framework. Through extreme C++ optimization and quantization techniques, it allows models that once required expensive GPUs to run smoothly on ordinary computers or even laptops.
The main product of this project is the llama library, whose C-style API can be found in the include/llama.h file. The project also includes a large number of example programs and tools, all developed based on the llama library, ranging from simple code snippets to more complex sub-projects such as an OpenAI-compatible HTTP server.
This article will guide you through the process of compiling llama.cpp from source on a Linux environment to achieve the best hardware adaptation performance, and provides a complete compilation command reference at the end.
π οΈ Prerequisites
Before starting the compilation, please ensure your system has the following basic tools installed:
- Basic Build Tools:
Git(for cloning source code) andCMake(for building the project). - Drivers & Toolkit (GPU Users only): If you are using an NVIDIA GPU, you must have the latest NVIDIA Driver and CUDA Toolkit installed.
- Environment Preparation: It is recommended to perform these operations in a dedicated folder.
π₯ Fetch the Source Code
First, clone the latest codebase from the official GitHub repository:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
π Quick Compilation Guide
1. Compiling the CPU Version (Generic)
If you don’t have a dedicated GPU or only need to run models on the CPU, you can use the simplest compilation method. This method has the strongest compatibility and works on all x86/ARM architectures.
# Create build directory and configure
cmake -B build
# Execute build (It's recommended to put your CPU thread count after -j, e.g., -j 8)
cmake --build build --config Release -j 8
2. Compiling the GPU Version (NVIDIA CUDA Acceleration) β Default Recommended
To achieve extremely fast inference speeds, it is strongly recommended to enable CUDA support. This allows the model to run in VRAM, which is dozens of times faster than the CPU.
Note: Please ensure that the nvcc command is available (meaning CUDA Toolkit is correctly installed).
# Configure CUDA acceleration parameters
# GGML_CUDA=ON: Enable CUDA support
# GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Enable unified memory, allowing use of system memory when VRAM is insufficient
# GGML_CUDA_NCCL=ON: Enable NCCL (NVIDIA Collective Communications Library) for multi-GPU communication
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 -DGGML_CUDA_NCCL=ON
# Execute build
cmake --build build --config Release -j 8
π llama.cpp Compilation Command Reference
This chapter summarizes all compilation methods for llama.cpp, covering various backends and platforms.
CPU Build
Basic CPU Build
Build llama.cpp using CMake:
cmake -B build
cmake --build build --config Release
Notes:
- To speed up compilation, add the
-jparameter for multi-threaded parallel compilation. For example:cmake --build build --config Release -j 8means running 8 tasks concurrently. Or use a compiler with default concurrency (like Ninja). - It is recommended to install the
ccachetool to speed up repeated compilations.
Debug Build
Single-configuration generators (like the default Unix Makefiles, which ignore the --config parameter):
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
Multi-configuration generators (like using -G "Xcode" or Visual Studio):
cmake -B build -G "Xcode"
cmake --build build --config Debug
Static Library Build
For a static library version, add the parameter -DBUILD_SHARED_LIBS=OFF:
cmake -B build -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release
Windows Build
Please install Visual Studio 2022 (e.g., Community Edition) with the following features:
- Workload: C++ desktop development
- Components: C++ CMake tools, Git for Windows, C++ Clang compiler, MSBuild support for LLVM toolchain (clang)
On Windows on ARM (arm64, WoA):
cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release
If using Ninja as the generator with clang compiler by default, set environment variables first:
set LIB=C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\x64;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\lib\x64\uwp;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64
cmake --preset x64-windows-llvm-release
cmake --build build-x64-windows-llvm-release
Curl (libcurl client library) is enabled by default. Disable it with -DLLAMA_CURL=OFF if not needed.
BLAS Build
Enabling BLAS (Basic Linear Algebra Subprograms) can improve inference speed when processing large batch data (e.g., batch size > 32, recommended 512). BLAS does not affect text generation performance.
Accelerate Framework
Only available on Mac and enabled by default. Follow the standard build process.
OpenBLAS
Pure CPU BLAS acceleration, requires OpenBLAS pre-installed on your machine.
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
Intel oneMKL
Compile with oneAPI compiler to enable avx_vnni instruction set on Intel processors that don’t support avx512.
After manually installing oneAPI:
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
cmake --build build --config Release
The default GGML_BLAS_VENDOR parameter is Generic. If you have loaded the Intel environment script and set -DGGML_BLAS=ON, it will automatically select the mkl Blas version.
For other BLAS libraries, specify via GGML_BLAS_VENDOR. See CMake Documentation for supported libraries.
Metal Build
On macOS, Metal (Apple’s custom graphics and compute API) is enabled by default, allowing computation tasks to be offloaded to the GPU. To disable Metal at compile time, add -DGGML_METAL=OFF.
If Metal support is enabled (default), you can explicitly disable GPU inference with the startup parameter --n-gpu-layers 0.
SYCL (Intel GPU)
SYCL is a high-level heterogeneous computing programming model that can improve cross-device (e.g., CPU, GPU, FPGA) development efficiency.
llama.cpp’s SYCL-based implementation supports Intel GPUs (such as Data Center Max series, Flex series, Arc series, integrated graphics, and iGPU).
For detailed instructions, see llama.cpp for SYCL.
CUDA (NVIDIA GPU) β Default Recommended
Accelerate using NVIDIA GPU. Please ensure CUDA Toolkit is installed.
Official Download
Please visit NVIDIA Developer Website to get the CUDA installer.
Compilation Command
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Override CUDA Architecture (Compute Capability) Settings
If nvcc cannot detect your GPU, you may see the following warning:
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
In this case, you can manually specify the CUDA compute capability:
- Check the Compute Capability of your NVIDIA device, for example:
| GPU Model | Compute Capability |
|---|---|
| GeForce RTX 4090 | 8.9 |
| GeForce RTX 3080 Ti | 8.6 |
| GeForce RTX 3070 | 8.6 |
- Manually list the Compute Capabilities to support in the CMake command, separated by semicolons
;:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
CUDA Runtime Environment Variables
You can set CUDA environment variables at runtime:
# Use CUDA_VISIBLE_DEVICES to hide the first compute device
CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
Unified Memory
On Linux, set the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to enable CUDA Unified Memory, which means when VRAM is exhausted, it will automatically use system memory (instead of crashing). On Windows, this setting can be achieved through the “System Memory Fallback” option in the NVIDIA Control Panel.
Performance Tuning Options
| Option | Values | Default | Description |
|---|---|---|---|
GGML_CUDA_FORCE_MMQ | Boolean | false | Force use of custom matrix multiplication cores for quantized models even without int8 tensor cores. Applicable to V100, CDNA, and RDNA3+. Enabled by default for GPUs with int8 tensor cores. Reduces large batch speed but lowers VRAM usage. |
GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force use of FP16 cuBLAS library for quantized models instead of custom matrix cores |
GGML_CUDA_F16 | Boolean | false | Enable half-precision floating-point (FP16) operations for CUDA dequantization, multiplication, and q4_1/q5_1 matrix multiplication cores. Suitable for newer GPUs to improve performance. |
GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive Integer | 128 | Maximum batch size when peer access (direct VRAM connection) is enabled for multi-GPU. Peer access is limited to Linux or NVLink support. |
GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile all KV cache quantization types for FlashAttention CUDA cores. Provides fine-grained KV cache control but takes longer to compile. |
MUSA (Moore Threads GPU)
MUSA provides acceleration support for Moore Threads GPUs. Please ensure MUSA SDK is installed.
Official Download
Visit Moore Threads Developer Center to download the MUSA SDK.
Compilation Command
cmake -B build -DGGML_MUSA=ON
cmake --build build --config Release
Override Compute Capability
All supported MUSA compute capabilities are enabled by default. To customize:
cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21"
cmake --build build --config Release
This configuration only enables devices with compute capability 2.1 (MTT S80), which can effectively reduce compilation time.
Static Library Compilation
cmake -B build -DGGML_MUSA=ON \
-DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release
Runtime Environment Variables
MUSA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
Unified Memory
Set the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to enable unified memory on Linux systems.
HIP (AMD GPU / ROCm)
HIP provides acceleration based on AMD GPUs (only supports devices with ROCm installed).
Please ensure ROCm driver is installed.
Linux Compilation (using AMD GPU with gfx1030 architecture as example)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16
If your hardware is RDNA3+ or CDNA architecture, add -DGGML_HIP_ROCWMMA_FATTN=ON to improve Flash Attention performance. Ensure rocWMMA library is installed.
Fix ROCm device library error:
If you see this error during compilation:
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
Find the directory containing oclc_abi_version_400.bc under HIP_PATH:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
HIP_DEVICE_LIB_PATH=<your directory path> \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -- -j 16
Windows Compilation
set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
cmake --build build
Make sure AMDGPU_TARGETS is set to the GPU architecture you want to support. For example, gfx1100 applies to Radeon RX 7900XTX/XT/GRE.
Query your GPU model:
rocminfo | grep gfx | head -1 | awk '{print $2}'
Unified Memory
On Linux systems, you can enable Unified Memory Architecture (UMA) by setting the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.
Vulkan
Windows Platform - w64devkit
- Download and extract w64devkit
- Download and install Vulkan SDK with default configuration
- Open w64devkit.exe and run the following commands to copy Vulkan dependency files:
SDK_VERSION=1.3.283.0
cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/
cp -r /VulkanSDK/$SDK_VERSION/Include/* $W64DEVKIT_HOME/x86_64-w64-mingw32/include/
cat > $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/pkgconfig/vulkan.pc <<EOF
Name: Vulkan-Loader
Description: Vulkan Loader
Version: $SDK_VERSION
Libs: -lvulkan-1
EOF
- Switch to llama.cpp project directory and build with CMake:
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Git Bash MINGW64
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Load model using Vulkan backend:
build/bin/Release/llama-cli -m "[model path]" -ngl 100 -c 16384 -t 10 -n -2 -cnv
MSYS2
pacman -S git \
mingw-w64-ucrt-x86_64-gcc \
mingw-w64-ucrt-x86_64-cmake \
mingw-w64-ucrt-x86_64-vulkan-devel \
mingw-w64-ucrt-x86_64-shaderc
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Using Docker
# Build image
docker build -t llama-cpp-vulkan --target light -f .devops/vulkan.Dockerfile .
# Run
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
Non-Docker Environment (Ubuntu 22.04)
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt update -y
apt-get install -y vulkan-sdk
vulkaninfo
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
# Test command
./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
CANN (Huawei Ascend NPU)
CANN supports acceleration using Huawei Ascend NPU AI cores.
Please ensure CANN Toolkit is installed. Download from CANN Toolkit.
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release
# Test
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32
If the screen outputs the following information, it means CANN backend is successfully used:
llm_load_tensors: CANN model buffer size = 13313.00 MiB
llama_new_context_with_model: CANN compute buffer size = 1260.81 MiB
For more details on model/device support and CANN installation, please refer to llama.cpp for CANN.
Arm KleidiAI
KleidiAI is an AI microkernel library specifically optimized for Arm CPUs. It can improve AI workload performance under ARM architecture through microkernels.
cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release
# Verify
./build/bin/llama-cli -m PATH_TO_MODEL -p "What is a car?"
If KleidiAI is enabled, the output will be similar to:
load_tensors: CPU_KLEIDIAI model buffer size = 3474.00 MiB
If the platform supports SME, you need to manually set the environment variable GGML_KLEIDIAI_SME=1 to enable the corresponding functionality.
Note: On some compilation targets, other higher-priority backends may be enabled by default. To force use of CPU backend, disable high-priority backends at compile time (e.g., -DGGML_METAL=OFF), or use --device none at runtime.
OpenCL
The OpenCL backend can achieve GPU acceleration through modern Adreno GPUs (Qualcomm chips).
Android Platform
Assuming NDK path is set to $ANDROID_NDK:
mkdir -p ~/dev/llm
cd ~/dev/llm
# Install OpenCL headers and ICD loader
git clone https://github.com/KhronosGroup/OpenCL-Headers && \
cd OpenCL-Headers && \
cp -r CL $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
cd OpenCL-ICD-Loader && \
mkdir build_ndk && cd build_ndk && \
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DOPENCL_ICD_LOADER_HEADERS_DIR=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=24 \
-DANDROID_STL=c++_shared && \
ninja && \
cp libOpenCL.so $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
# Enable OpenCL build for llama.cpp
cd ~/dev/llm
git clone https://github.com/ggml-org/llama.cpp && \
cd llama.cpp && \
mkdir build-android && cd build-android
cmake .. -G Ninja \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_OPENCL=ON
ninja
Windows Arm64
mkdir -p ~/dev/llm
cd ~/dev/llm
# Install OpenCL headers
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
mkdir build && cd build
cmake .. -G Ninja \
-DBUILD_TESTING=OFF \
-DOPENCL_HEADERS_BUILD_TESTING=OFF \
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install
# Install OpenCL ICD Loader
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
mkdir build && cd build
cmake .. -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install
# Enable OpenCL and compile
cmake .. -G Ninja \
-DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_OPENCL=ON
ninja
βοΈ GPU Acceleration Backend Notes
Multi-Backend Compilation
In most cases, you can compile and use multiple backends simultaneously (e.g., CUDA and Vulkan):
cmake .. -DGGML_CUDA=ON -DGGML_VULKAN=ON
At runtime, specify the backend device using the --device parameter. To view all available devices:
./main --list-devices
Dynamic Library Loading
Backends can also be built as dynamic libraries (shared libraries) for on-demand loading at runtime. This means the same llama.cpp executable can run on machines with different GPU devices, automatically adapting to the corresponding environment. To enable this feature, use the GGML_BACKEND_DL option when building.
Completely Disable GPU
Even with the -ngl 0 option, some computations may still use the GPU. If you want to completely disable GPU acceleration, add the parameter --device none.
π Verification and Execution
After compilation, all executable files (such as main, server, etc.) and related library files will be stored in the following path:
build/bin/
You can verify the success of the compilation with the following command:
./build/bin/llama-cli --version
π‘ Advanced Tips
- Performance Tuning: When running a model, you can use the
-tparameter to specify the number of CPU threads, and-ngl(GPU Layers) to specify how many model layers to load into VRAM. - Memory Suggestion: If you encounter Out-of-Memory (OOM) errors when compiling the GPU version, try reducing the
-jthread count duringcmake --build(e.g., change to-j 2or-j 1). - Choosing the right backend for your hardware at build time and runtime can significantly improve inference speed and efficiency. For beginners, it is recommended to try a single backend first, confirm the environment configuration is correct, and then try advanced features like multi-backend or dynamic library loading.
Now you can start downloading GGUF format model files and experience the power of local AI!