Compiling llama.cpp on Linux: Full Guide from CPU to CUDA Acceleration

A detailed guide on how to compile llama.cpp from source on Linux, covering basic CPU versions and NVIDIA GPU (CUDA) acceleration configuration steps. Includes complete compilation command reference.

For developers looking to run Large Language Models (LLMs) locally, llama.cpp is currently the most critical open-source inference framework. Through extreme C++ optimization and quantization techniques, it allows models that once required expensive GPUs to run smoothly on ordinary computers or even laptops.

The main product of this project is the llama library, whose C-style API can be found in the include/llama.h file. The project also includes a large number of example programs and tools, all developed based on the llama library, ranging from simple code snippets to more complex sub-projects such as an OpenAI-compatible HTTP server.

This article will guide you through the process of compiling llama.cpp from source on a Linux environment to achieve the best hardware adaptation performance, and provides a complete compilation command reference at the end.

🛠️ Prerequisites

Before starting the compilation, please ensure your system has the following basic tools installed:

Basic Build Tools: Git (for cloning source code) and CMake (for building the project).
Drivers & Toolkit (GPU Users only): If you are using an NVIDIA GPU, you must have the latest NVIDIA Driver and CUDA Toolkit installed.
Environment Preparation: It is recommended to perform these operations in a dedicated folder.

📥 Fetch the Source Code

First, clone the latest codebase from the official GitHub repository:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

🚀 Quick Compilation Guide

1. Compiling the CPU Version (Generic)

If you don’t have a dedicated GPU or only need to run models on the CPU, you can use the simplest compilation method. This method has the strongest compatibility and works on all x86/ARM architectures.

# Create build directory and configure
cmake -B build

# Execute build (It's recommended to put your CPU thread count after -j, e.g., -j 8)
cmake --build build --config Release -j 8

2. Compiling the GPU Version (NVIDIA CUDA Acceleration) ← Default Recommended

To achieve extremely fast inference speeds, it is strongly recommended to enable CUDA support. This allows the model to run in VRAM, which is dozens of times faster than the CPU.

Note: Please ensure that the nvcc command is available (meaning CUDA Toolkit is correctly installed).

# Configure CUDA acceleration parameters
# GGML_CUDA=ON: Enable CUDA support
# GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Enable unified memory, allowing use of system memory when VRAM is insufficient
# GGML_CUDA_NCCL=ON: Enable NCCL (NVIDIA Collective Communications Library) for multi-GPU communication
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 -DGGML_CUDA_NCCL=ON

# Execute build
cmake --build build --config Release -j 8

📚 llama.cpp Compilation Command Reference

This chapter summarizes all compilation methods for llama.cpp, covering various backends and platforms.

CPU Build

Basic CPU Build

Build llama.cpp using CMake:

cmake -B build
cmake --build build --config Release

Notes:

To speed up compilation, add the -j parameter for multi-threaded parallel compilation. For example: cmake --build build --config Release -j 8 means running 8 tasks concurrently. Or use a compiler with default concurrency (like Ninja).
It is recommended to install the ccache tool to speed up repeated compilations.

Debug Build

Single-configuration generators (like the default Unix Makefiles, which ignore the --config parameter):

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

Multi-configuration generators (like using -G "Xcode" or Visual Studio):

cmake -B build -G "Xcode"
cmake --build build --config Debug

Static Library Build

For a static library version, add the parameter -DBUILD_SHARED_LIBS=OFF:

cmake -B build -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release

Windows Build

Please install Visual Studio 2022 (e.g., Community Edition) with the following features:

Workload: C++ desktop development
Components: C++ CMake tools, Git for Windows, C++ Clang compiler, MSBuild support for LLVM toolchain (clang)

On Windows on ARM (arm64, WoA):

cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release

If using Ninja as the generator with clang compiler by default, set environment variables first:

set LIB=C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\x64;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\lib\x64\uwp;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64

cmake --preset x64-windows-llvm-release
cmake --build build-x64-windows-llvm-release

Curl (libcurl client library) is enabled by default. Disable it with -DLLAMA_CURL=OFF if not needed.

BLAS Build

Enabling BLAS (Basic Linear Algebra Subprograms) can improve inference speed when processing large batch data (e.g., batch size > 32, recommended 512). BLAS does not affect text generation performance.

Accelerate Framework

Only available on Mac and enabled by default. Follow the standard build process.

OpenBLAS

Pure CPU BLAS acceleration, requires OpenBLAS pre-installed on your machine.

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Intel oneMKL

Compile with oneAPI compiler to enable avx_vnni instruction set on Intel processors that don’t support avx512.

After manually installing oneAPI:

source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
cmake --build build --config Release

The default GGML_BLAS_VENDOR parameter is Generic. If you have loaded the Intel environment script and set -DGGML_BLAS=ON, it will automatically select the mkl Blas version.

For other BLAS libraries, specify via GGML_BLAS_VENDOR. See CMake Documentation for supported libraries.

Metal Build

On macOS, Metal (Apple’s custom graphics and compute API) is enabled by default, allowing computation tasks to be offloaded to the GPU. To disable Metal at compile time, add -DGGML_METAL=OFF.

If Metal support is enabled (default), you can explicitly disable GPU inference with the startup parameter --n-gpu-layers 0.

SYCL (Intel GPU)

SYCL is a high-level heterogeneous computing programming model that can improve cross-device (e.g., CPU, GPU, FPGA) development efficiency.

llama.cpp’s SYCL-based implementation supports Intel GPUs (such as Data Center Max series, Flex series, Arc series, integrated graphics, and iGPU).

For detailed instructions, see llama.cpp for SYCL.

CUDA (NVIDIA GPU) ← Default Recommended

Accelerate using NVIDIA GPU. Please ensure CUDA Toolkit is installed.

Official Download

Please visit NVIDIA Developer Website to get the CUDA installer.

Compilation Command

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Override CUDA Architecture (Compute Capability) Settings

If nvcc cannot detect your GPU, you may see the following warning:

nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used

In this case, you can manually specify the CUDA compute capability:

Check the Compute Capability of your NVIDIA device, for example:

GPU Model	Compute Capability
GeForce RTX 4090	8.9
GeForce RTX 3080 Ti	8.6
GeForce RTX 3070	8.6

Manually list the Compute Capabilities to support in the CMake command, separated by semicolons ;:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"

CUDA Runtime Environment Variables

You can set CUDA environment variables at runtime:

# Use CUDA_VISIBLE_DEVICES to hide the first compute device
CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf

Unified Memory

On Linux, set the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to enable CUDA Unified Memory, which means when VRAM is exhausted, it will automatically use system memory (instead of crashing). On Windows, this setting can be achieved through the “System Memory Fallback” option in the NVIDIA Control Panel.

Performance Tuning Options

Option	Values	Default	Description
`GGML_CUDA_FORCE_MMQ`	Boolean	`false`	Force use of custom matrix multiplication cores for quantized models even without int8 tensor cores. Applicable to V100, CDNA, and RDNA3+. Enabled by default for GPUs with int8 tensor cores. Reduces large batch speed but lowers VRAM usage.
`GGML_CUDA_FORCE_CUBLAS`	Boolean	`false`	Force use of FP16 cuBLAS library for quantized models instead of custom matrix cores
`GGML_CUDA_F16`	Boolean	`false`	Enable half-precision floating-point (FP16) operations for CUDA dequantization, multiplication, and q4_1/q5_1 matrix multiplication cores. Suitable for newer GPUs to improve performance.
`GGML_CUDA_PEER_MAX_BATCH_SIZE`	Positive Integer	`128`	Maximum batch size when peer access (direct VRAM connection) is enabled for multi-GPU. Peer access is limited to Linux or NVLink support.
`GGML_CUDA_FA_ALL_QUANTS`	Boolean	`false`	Compile all KV cache quantization types for FlashAttention CUDA cores. Provides fine-grained KV cache control but takes longer to compile.

MUSA (Moore Threads GPU)

MUSA provides acceleration support for Moore Threads GPUs. Please ensure MUSA SDK is installed.

Official Download

Visit Moore Threads Developer Center to download the MUSA SDK.

Compilation Command

cmake -B build -DGGML_MUSA=ON
cmake --build build --config Release

Override Compute Capability

All supported MUSA compute capabilities are enabled by default. To customize:

cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21"
cmake --build build --config Release

This configuration only enables devices with compute capability 2.1 (MTT S80), which can effectively reduce compilation time.

Static Library Compilation

cmake -B build -DGGML_MUSA=ON \
  -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release

Runtime Environment Variables

MUSA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf

Unified Memory

Set the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to enable unified memory on Linux systems.

HIP (AMD GPU / ROCm)

HIP provides acceleration based on AMD GPUs (only supports devices with ROCm installed).

Please ensure ROCm driver is installed.

Linux Compilation (using AMD GPU with gfx1030 architecture as example)

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

If your hardware is RDNA3+ or CDNA architecture, add -DGGML_HIP_ROCWMMA_FATTN=ON to improve Flash Attention performance. Ensure rocWMMA library is installed.

Fix ROCm device library error:

If you see this error during compilation:

clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library

Find the directory containing oclc_abi_version_400.bc under HIP_PATH:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
HIP_DEVICE_LIB_PATH=<your directory path> \
    cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -- -j 16

Windows Compilation

set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
cmake --build build

Make sure AMDGPU_TARGETS is set to the GPU architecture you want to support. For example, gfx1100 applies to Radeon RX 7900XTX/XT/GRE.

Query your GPU model:

rocminfo | grep gfx | head -1 | awk '{print $2}'

Unified Memory

On Linux systems, you can enable Unified Memory Architecture (UMA) by setting the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.

Vulkan

Windows Platform - w64devkit

Download and extract w64devkit
Download and install Vulkan SDK with default configuration
Open w64devkit.exe and run the following commands to copy Vulkan dependency files:

SDK_VERSION=1.3.283.0
cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/
cp -r /VulkanSDK/$SDK_VERSION/Include/* $W64DEVKIT_HOME/x86_64-w64-mingw32/include/
cat > $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/pkgconfig/vulkan.pc <<EOF
Name: Vulkan-Loader
Description: Vulkan Loader
Version: $SDK_VERSION
Libs: -lvulkan-1
EOF

Switch to llama.cpp project directory and build with CMake:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Git Bash MINGW64

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Load model using Vulkan backend:

build/bin/Release/llama-cli -m "[model path]" -ngl 100 -c 16384 -t 10 -n -2 -cnv

MSYS2

pacman -S git \
    mingw-w64-ucrt-x86_64-gcc \
    mingw-w64-ucrt-x86_64-cmake \
    mingw-w64-ucrt-x86_64-vulkan-devel \
    mingw-w64-ucrt-x86_64-shaderc

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Using Docker

# Build image
docker build -t llama-cpp-vulkan --target light -f .devops/vulkan.Dockerfile .

# Run
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

Non-Docker Environment (Ubuntu 22.04)

wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt update -y
apt-get install -y vulkan-sdk
vulkaninfo

cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

# Test command
./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4

CANN (Huawei Ascend NPU)

CANN supports acceleration using Huawei Ascend NPU AI cores.

Please ensure CANN Toolkit is installed. Download from CANN Toolkit.

cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release

# Test
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32

If the screen outputs the following information, it means CANN backend is successfully used:

llm_load_tensors:       CANN model buffer size = 13313.00 MiB
llama_new_context_with_model:       CANN compute buffer size =  1260.81 MiB

For more details on model/device support and CANN installation, please refer to llama.cpp for CANN.

Arm KleidiAI

KleidiAI is an AI microkernel library specifically optimized for Arm CPUs. It can improve AI workload performance under ARM architecture through microkernels.

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

# Verify
./build/bin/llama-cli -m PATH_TO_MODEL -p "What is a car?"

If KleidiAI is enabled, the output will be similar to:

load_tensors: CPU_KLEIDIAI model buffer size = 3474.00 MiB

If the platform supports SME, you need to manually set the environment variable GGML_KLEIDIAI_SME=1 to enable the corresponding functionality.

Note: On some compilation targets, other higher-priority backends may be enabled by default. To force use of CPU backend, disable high-priority backends at compile time (e.g., -DGGML_METAL=OFF), or use --device none at runtime.

OpenCL

The OpenCL backend can achieve GPU acceleration through modern Adreno GPUs (Qualcomm chips).

Android Platform

Assuming NDK path is set to $ANDROID_NDK:

mkdir -p ~/dev/llm
cd ~/dev/llm

# Install OpenCL headers and ICD loader
git clone https://github.com/KhronosGroup/OpenCL-Headers && \
cd OpenCL-Headers && \
cp -r CL $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include

cd ~/dev/llm

git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
cd OpenCL-ICD-Loader && \
mkdir build_ndk && cd build_ndk && \
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DOPENCL_ICD_LOADER_HEADERS_DIR=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=24 \
  -DANDROID_STL=c++_shared && \
ninja && \
cp libOpenCL.so $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android

# Enable OpenCL build for llama.cpp
cd ~/dev/llm
git clone https://github.com/ggml-org/llama.cpp && \
cd llama.cpp && \
mkdir build-android && cd build-android

cmake .. -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=android-28 \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_OPENCL=ON

ninja

Windows Arm64

mkdir -p ~/dev/llm
cd ~/dev/llm

# Install OpenCL headers
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
mkdir build && cd build
cmake .. -G Ninja \
  -DBUILD_TESTING=OFF \
  -DOPENCL_HEADERS_BUILD_TESTING=OFF \
  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# Install OpenCL ICD Loader
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
mkdir build && cd build
cmake .. -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# Enable OpenCL and compile
cmake .. -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_OPENCL=ON
ninja

⚙️ GPU Acceleration Backend Notes

Multi-Backend Compilation

In most cases, you can compile and use multiple backends simultaneously (e.g., CUDA and Vulkan):

cmake .. -DGGML_CUDA=ON -DGGML_VULKAN=ON

At runtime, specify the backend device using the --device parameter. To view all available devices:

./main --list-devices

Dynamic Library Loading

Backends can also be built as dynamic libraries (shared libraries) for on-demand loading at runtime. This means the same llama.cpp executable can run on machines with different GPU devices, automatically adapting to the corresponding environment. To enable this feature, use the GGML_BACKEND_DL option when building.

Completely Disable GPU

Even with the -ngl 0 option, some computations may still use the GPU. If you want to completely disable GPU acceleration, add the parameter --device none.

📁 Verification and Execution

After compilation, all executable files (such as main, server, etc.) and related library files will be stored in the following path:

build/bin/

You can verify the success of the compilation with the following command:

./build/bin/llama-cli --version

💡 Advanced Tips

Performance Tuning: When running a model, you can use the -t parameter to specify the number of CPU threads, and -ngl (GPU Layers) to specify how many model layers to load into VRAM.
Memory Suggestion: If you encounter Out-of-Memory (OOM) errors when compiling the GPU version, try reducing the -j thread count during cmake --build (e.g., change to -j 2 or -j 1).
Choosing the right backend for your hardware at build time and runtime can significantly improve inference speed and efficiency. For beginners, it is recommended to try a single backend first, confirm the environment configuration is correct, and then try advanced features like multi-backend or dynamic library loading.

Now you can start downloading GGUF format model files and experience the power of local AI!