Qwen 3.6-27B language model on RTX 2080 Ti 22GiB

Qwen 3.6-27B is a state-of-the-art large language model suitable for local deployment. With technologies such as quantization, MTP, and TurboQuant, this language model can be deployed and run smoothly on an RTX 2080 Ti with 22GiB VRAM.

The following steps have been successfully tested on the latest version of CachyOS (installed using the 260426 ISO).

Installing dependencies

sudo pacman -Syu cmake cuda nodejs npm
paru -S python-modelscope

Configure npm mirror if necessary:

npm config set registry https://mirrors.cloud.tencent.com/npm/
npm config set strict-ssl false

Rebooting or relogging in is recommended for refreshing environment variables.

Compiling llama-cpp-turboquant

cd
git clone https://github.com/TheTom/llama-cpp-turboquant.git
# Following three lines for compiling Web UI frontend
cd llama-cpp-turboquant/tools/ui
npm i
npm run build

cd ../..
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_WEBUI=ON && cmake --build build --config Release -j --target llama-server llama-cli

Executables are in build/bin directory.

Downloading the model

modelscope download --model Tariel/Qwen3.6-27B-4bpw-MTP.gguf --local_dir ~/Qwen3.6

Deploying

cd ~/llama-cpp-turboquant/build/bin
./llama-server -m ~/Qwen3.6/Qwen3.6-27B-4bpw-MTP.gguf \
    -mm ~/Qwen3.6/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf \
    -fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu \
    -c 190208 \
    -ngl all -ctk q8_0 -ctv turbo3 \
    --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0 \
    --host 0.0.0.0 \
    --port 22345 \
    -a qwen3.6:27b

Usage

Visit port 22345 (e.g., http://127.0.0.1:22345) in the Web browser for llama.cpp Web UI.

Allow port 22345 in the ufw firewall in CachyOS for visiting from other machines in the network:

sudo ufw allow 22345

llama-server supports OpenAI and Anthropic APIs, allowing user to use many AI clients with it.

Notes on Cherry Studio

To turn on/off thinking mode properly in Cherry Studio, use “nvidia” provider and config API Host to llama.cpp address (see Issue #14981).

Performance

The environment for performance testing is: CPU i5-12500H (using integrated graphics for video output), GPU RTX 2080 Ti 22GiB, Kernel Linux 7.0.10-1-cachyos.

MTP significantly improves the decoding performance and overall performance of the model at the cost of slightly increased VRAM usage and a small reduction in prefill speed.

Input tokens Output tokens (without MTP) Prefill time/speed (w/o MTP) Decode time/speed (w/o MTP) Overall time (w/o MTP) Output tokens (with MTP) Prefill time/speed (w MTP) Decode time/speed (w MTP) Overall time (w MTP)
30667 1666 52.9 s, 1.72 ms/token, 580.16 token/s 116.6 s, 70.02 ms/token, 14.28 token/s 169.5 s 1556 64.4 s, 2.10 ms/token, 476.14 token/s (speed -17.9%) 53.3 s, 34.22 ms/token, 29.22 token/s (speed +104.6%) 117.7 s
159283 1679 500.8 s, 3.14 ms/token, 318.04 token/s 381.4 s, 227.17 ms/token, 4.40 token/s 882.3 s 1337 579.6 s, 3.64 ms/token, 274.80 token/s (speed -13.6%) 87.4 s, 65.37 ms/token, 15.30 token/s (speed +247.7%) 667.0 s

Performance on non-thinking mode:

Input tokens Output tokens VRAM (MiB) Prefill speed (token/s) Decode speed (token/s) Task
0 / 20270 / / /
41 1632 29296 158.78 41.15 Problem solving
75 1260 20298 204.50 39.02 Problem solving
15202 15422 20526 513.91 38.69 Translation
30667 1556 20538 476.14 29.22 Text summary
53108 2058 20731 419.84 25.17 Text summary
83743 1612 20954 364.67 20.50 Text summary
106196 1486 21126 333.72 19.39 Text summary
136828 1721 21370 296.02 16.11 Text summary
159283 1337 21542 274.80 15.30 Text summary
189915 293 21770 249.99 14.50 Text summary (output truncated)

Performance on thinking mode:

Input tokens Output tokens VRAM (MiB) Prefill speed (token/s) Decode speed (token/s) Task
0 / 20270 / / /
39 7887 20344 131.31 38.94 Problem solving
73 10436 20370 207.27 36.13 Problem solving
15200 18345 20550 518.64 35.71 Translation
30665 3043 20550 478.33 28.10 Text summary
53106 3150 20726 422.29 24.45 Text summary
83741 3715 20970 364.77 21.54 Text summary
106194 3330 21142 333.43 19.65 Text summary
136826 3860 21386 297.03 16.79 Text summary
159281 3039 21554 273.92 15.35 Text summary
189913 294 21770 249.73 15.70 Text summary (output truncated)