Running DeepSeek-R1 on Ubuntu with llama.cpp

This tutorial will guide you through installing `llama.cpp` and running the DeepSeek-R1 model locally on a Debian Distro. The process involves cloning the repository, configuring the build based on your hardware, and setting up the server to run the model.

Prerequisites

Before starting, ensure you have the following installed on your system:

System Dependencies

Git: For cloning the repository.
```
sudo apt-get install git
```
CMake: For building the project.
```
sudo apt-get install cmake
```
Build Essentials: Required for compiling the project.
```
sudo apt-get install build-essential
```
CUDA (Optional): If you plan to use CUDA-enabled GPUs.
- Install CUDA toolkit from NVIDIA’s official website.

Tools for Downloading the Model

wget or Hugging Face CLI: To download the model.

Step-by-Step Installation

Step 1: Clone the Repository

First, clone the llama.cpp repository to your local machine.

git clone https://github.com/ggml-org/llama.cpp.git

Step 2: Navigate to the Repository Directory

Change directory to the cloned repository.

cd llama.cpp

Step 3: Pull the Latest Changes

Ensure you have the latest version of the code.

git pull

Step 4: Choose Hardware Configuration

Select the appropriate build configuration based on your hardware.

For CPU Only

cmake -B build

For CUDA GPU

If you have CUDA installed and want to leverage GPU acceleration:

cmake -B build -DGGML_CUDA=ON

Step 5: Build the Project

Compile the project with the chosen configuration.

cmake --build build --config Release

Note: The build process may take several minutes, depending on your system’s performance.

Step 6: Copy Binaries to the Bin Directory

After a successful build, copy the generated binaries to a directory of your choice (e.g., ~/bin).

mkdir -p ~/bin
cp build/bin/* ~/bin/

Step 7: Download the DeepSeek-R1 Model

The DeepSeek-R1 model is available on Hugging Face. Use wget or the Hugging Face CLI to download it. For this tutorial, we will download the small 1.5B parameters version.

First, create the model directory:

mkdir -p ~/.local/share/llama.cpp/models

Download the model:

wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -O ~/.local/share/llama.cpp/models/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf

Step 8: Run the Server

Navigate to your bin directory:
```
cd ~/bin
```

Run the server with the following command:

./llama-server \
  --model ~/.local/share/llama.cpp/models/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf \
  --ctx-size 32000 \
  --host 127.0.0.1 \
  --port 8080 \
  --main-gpu 0 \
  --gpu-layers 90

Parameters:

--model: Path to the DeepSeek-R1 model file.
--ctx-size: Context window size (adjust as needed).
--host: Server host address (default: localhost).
--port: Server port (default: 8080).
--main-gpu: Used for multi-GPU for inference.
--gpu-layers: Number of layers to offload to GPU (adjust based on your GPU’s VRAM).

Step 9: Verify the Installation

Ensure the server is running without errors.
You can access the chat UI at http://localhost:8080 to chat with the model.

Troubleshooting

Build Errors: Ensure all dependencies are installed and your CUDA setup is correct.
Permission Issues: Use sudo where necessary or adjust file permissions.
CUDA Not Detected: Verify CUDA is installed and properly configured on your system.
Model Not Found: Double-check the model path and ensure the file exists.

Conclusion

You have successfully installed llama.cpp and set up the DeepSeek-R1 model to run locally. The server is now ready to serve requests at http://localhost:8080. You can use the chat interface in the link to interact with the model.

Note: Replace placeholders like ~/bin and ~/.local/share/llama.cpp/models with your actual directory paths as needed.

Enjoy exploring the capabilities of DeepSeek-R1 with llama.cpp!