Qwen 3.6-27B language model on RTX 2080 Ti 22GiB
Qwen 3.6-27B is a state-of-the-art large language model suitable for local deployment. With technologies such as quantization, MTP, and TurboQuant, this language model can be deployed and run smoothly on an RTX 2080 Ti with 22GiB VRAM.
The following steps have been successfully tested on the latest version of CachyOS (installed using the 260426 ISO).
Installing dependencies
sudo pacman -Syu cmake cuda nodejs npm
paru -S python-modelscope
Configure npm mirror if necessary:
npm config set registry https://mirrors.cloud.tencent.com/npm/
npm config set strict-ssl false
Rebooting or relogging in is recommended for refreshing environment variables.
Compiling llama-cpp-turboquant
cd
git clone https://github.com/TheTom/llama-cpp-turboquant.git
# Following three lines for compiling Web UI frontend
cd llama-cpp-turboquant/tools/ui
npm i
npm run build
cd ../..
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_WEBUI=ON && cmake --build build --config Release -j --target llama-server llama-cli
Executables are in build/bin directory.
Downloading the model
modelscope download --model Tariel/Qwen3.6-27B-4bpw-MTP.gguf --local_dir ~/Qwen3.6
Deploying
cd ~/llama-cpp-turboquant/build/bin
./llama-server -m ~/Qwen3.6/Qwen3.6-27B-4bpw-MTP.gguf \
-mm ~/Qwen3.6/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf \
-fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu \
-c 190208 \
-ngl all -ctk q8_0 -ctv turbo3 \
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0 \
--host 0.0.0.0 \
--port 22345 \
-a qwen3.6:27b
Usage
Visit port 22345 (e.g., http://127.0.0.1:22345) in the Web browser for llama.cpp Web UI.
Allow port 22345 in the ufw firewall in CachyOS for visiting from other machines in the network:
sudo ufw allow 22345
llama-server supports OpenAI and Anthropic APIs, allowing user to use many AI clients with it.
Notes on Cherry Studio
To turn on/off thinking mode properly in Cherry Studio, use “nvidia” provider and config API Host to llama.cpp address (see Issue #14981).
Performance
The environment for performance testing is: CPU i5-12500H (using integrated graphics for video output), GPU RTX 2080 Ti 22GiB, Kernel Linux 7.0.10-1-cachyos.
MTP significantly improves the decoding performance and overall performance of the model at the cost of slightly increased VRAM usage and a small reduction in prefill speed.
| Input tokens | Output tokens (without MTP) | Prefill time/speed (w/o MTP) | Decode time/speed (w/o MTP) | Overall time (w/o MTP) | Output tokens (with MTP) | Prefill time/speed (w MTP) | Decode time/speed (w MTP) | Overall time (w MTP) |
|---|---|---|---|---|---|---|---|---|
| 30667 | 1666 | 52.9 s, 1.72 ms/token, 580.16 token/s | 116.6 s, 70.02 ms/token, 14.28 token/s | 169.5 s | 1556 | 64.4 s, 2.10 ms/token, 476.14 token/s (speed -17.9%) | 53.3 s, 34.22 ms/token, 29.22 token/s (speed +104.6%) | 117.7 s |
| 159283 | 1679 | 500.8 s, 3.14 ms/token, 318.04 token/s | 381.4 s, 227.17 ms/token, 4.40 token/s | 882.3 s | 1337 | 579.6 s, 3.64 ms/token, 274.80 token/s (speed -13.6%) | 87.4 s, 65.37 ms/token, 15.30 token/s (speed +247.7%) | 667.0 s |
Performance on non-thinking mode:
| Input tokens | Output tokens | VRAM (MiB) | Prefill speed (token/s) | Decode speed (token/s) | Task |
|---|---|---|---|---|---|
| 0 | / | 20270 | / | / | / |
| 41 | 1632 | 29296 | 158.78 | 41.15 | Problem solving |
| 75 | 1260 | 20298 | 204.50 | 39.02 | Problem solving |
| 15202 | 15422 | 20526 | 513.91 | 38.69 | Translation |
| 30667 | 1556 | 20538 | 476.14 | 29.22 | Text summary |
| 53108 | 2058 | 20731 | 419.84 | 25.17 | Text summary |
| 83743 | 1612 | 20954 | 364.67 | 20.50 | Text summary |
| 106196 | 1486 | 21126 | 333.72 | 19.39 | Text summary |
| 136828 | 1721 | 21370 | 296.02 | 16.11 | Text summary |
| 159283 | 1337 | 21542 | 274.80 | 15.30 | Text summary |
| 189915 | 293 | 21770 | 249.99 | 14.50 | Text summary (output truncated) |
Performance on thinking mode:
| Input tokens | Output tokens | VRAM (MiB) | Prefill speed (token/s) | Decode speed (token/s) | Task |
|---|---|---|---|---|---|
| 0 | / | 20270 | / | / | / |
| 39 | 7887 | 20344 | 131.31 | 38.94 | Problem solving |
| 73 | 10436 | 20370 | 207.27 | 36.13 | Problem solving |
| 15200 | 18345 | 20550 | 518.64 | 35.71 | Translation |
| 30665 | 3043 | 20550 | 478.33 | 28.10 | Text summary |
| 53106 | 3150 | 20726 | 422.29 | 24.45 | Text summary |
| 83741 | 3715 | 20970 | 364.77 | 21.54 | Text summary |
| 106194 | 3330 | 21142 | 333.43 | 19.65 | Text summary |
| 136826 | 3860 | 21386 | 297.03 | 16.79 | Text summary |
| 159281 | 3039 | 21554 | 273.92 | 15.35 | Text summary |
| 189913 | 294 | 21770 | 249.73 | 15.70 | Text summary (output truncated) |