Qwen 3.6-27B language model on RTX 2080 Ti 22GiB

Posted on 2026-05-31 • 1066 words • 6 minute read

Tags: Knowledge base, CachyOS, Linux, Qwen, AI, Language model

Qwen 3.6-27B is a state-of-the-art large language model suitable for local deployment. With technologies such as quantization, MTP, and TurboQuant, this language model can be deployed and run smoothly on an RTX 2080 Ti with 22GiB VRAM.

The following steps have been successfully tested on the latest version of CachyOS (installed using the 260426 ISO).

Installing Dependencies

sudo pacman -Syu cmake cuda nodejs npm
paru -Sy python-modelscope

Configure npm mirror if necessary:

npm config set registry https://mirrors.cloud.tencent.com/npm/
npm config set strict-ssl false

Rebooting or re-logging in is recommended for refreshing environment variables.

Compiling `llama-cpp-turboquant`

cd
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant

# Updated 2026-06-17: The latest version of llama-cpp-turboquant (9901 35ac80d)
# fixed the performance issue on decoding with a slightly larger GPU memory footprint
# (see https://github.com/TheTom/llama-cpp-turboquant/issues/177).
# The context length may need to be adjusted.
#
# Updated 2026-06-11: The latest version of llama-cpp-turboquant (9450 73eb521)
# has some performance issues on decoding.
# It is recommended to use version 9438 ab11a71, which has slightly improved
# prefilling performance compared to version 9418 2cbfdc6 in the performance test.
# git checkout ab11a71

# Following three lines for compiling Web UI frontend
cd tools/ui
npm i
npm run build

cd ../..
export CUDACXX="/opt/cuda/bin/nvcc"
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_WEBUI=ON && cmake --build build --config Release -j --target llama-server llama-cli

Executables are in build/bin directory.

Downloading the Model

modelscope download --model Tariel/Qwen3.6-27B-4bpw-MTP.gguf --local_dir ~/Qwen3.6

Deploying

cd ~/llama-cpp-turboquant/build/bin
./llama-server -m ~/Qwen3.6/Qwen3.6-27B-4bpw-MTP.gguf \
    -mm ~/Qwen3.6/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf \
    -ngl all -fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu \
    --ctx-checkpoints 64 --threads-http 2 \
    -ctk q8_0 -ctv turbo3 \
    -c 190208 \
    --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \
    --host 0.0.0.0 --port 22345 \
    --api-key "sk-jintianshifengkuangxingqisivivo50" \
    -a qwen3.6-27b \
    --jinja --chat-template-file ~/Qwen3.6/chat_template.jinja \
    --reasoning on --reasoning-format deepseek

Please modify API key followed by --api-key. If authentication is not needed, just delete --api-key argument.

Usage

Visit port 22345 (e.g., http://127.0.0.1:22345) in the Web browser for llama.cpp Web UI.

Allow port 22345 in the ufw firewall in CachyOS for visiting from other machines in the network:

sudo ufw allow 22345

llama-server supports OpenAI and Anthropic APIs, allowing user to use many AI clients with it.

Turning on/off Thinking with the Chat Template

Chat template chat_template.jinja in the model repo (modified from BoFan-tunning/llama.cpp-MTP-TurboQuant, removed auto think switch for stable context cache reuse in opencode) provides on-the-fly thinking mode switching by adding <|think_on|> or <|think_off|> in the input stream to turn on/off thinking.

Notes on Cherry Studio

To turn on/off thinking mode properly in Cherry Studio, use “nvidia” provider and config API Host to llama.cpp address (see Issue #14981).

Configuring OpenCode

Install opencode and llama-swap (for creating thinking and non-thinking variants of the model). rtk is recommended for reducing token consumption.

paru -Sy opencode-bin llama-swap-bin rtk
rtk init -g --opencode	# Initialize rtk for opencode

Create llama-swap-config.yaml:

captureBuffer: 20
performance:
  every: 10s
startPort: 60001
sendLoadingState: true

macros:
  "llama-turboquant": >
    /home/user/llama-cpp-turboquant-bin/llama-server --port ${PORT}

  "qwen_dir": "/home/user/Qwen3.6"

apiKeys:
  - "sk-jintianshifengkuangxingqisivivo50"

models:
  "qwen3.6-27b":
    cmd: |
      ${llama-turboquant}
      -m ${qwen_dir}/Qwen3.6-27B-4bpw-MTP.gguf
      -mm ${qwen_dir}/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf
      -ngl all -fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu
      --ctx-checkpoints 64 --threads-http 2
      -ctk q8_0 -ctv turbo3
      -c 190208
      --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0
      --jinja --chat-template-file ${qwen_dir}/chat_template.jinja
      --reasoning on --reasoning-format deepseek

    name: "Qwen 3.6 27B"
    ttl: 0
    filters:
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
    concurrencyLimit: 2
    timeouts:
      connect: 30
      keepalive: 30
      responseHeader: 60
      tlsHandshake: 10
      idleConn: 90

hooks:
  on_startup:
    preload:
      - "qwen3.6-27b"

Replace the paths of "llama-turboquant" and "qwen_dir" in macros:, as well as apiKeys.

Modify ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llama.cpp/qwen3.6-27b:thinking",
  "small_model": "llama.cpp/qwen3.6-27b:instruct",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (Local)",
      "options": {
        "baseURL": "http://localhost:22345/v1",
        "apiKey": "sk-jintianshifengkuangxingqisivivo50"
      },
      "models": {
        "qwen3.6-27b:thinking": {
          "name": "Qwen3.6 27B (thinking mode)",
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": {
            "context": 190208,
            "output": 65536
          },
          "options": {
            "max_tokens": 65536
          }
        },
        "qwen3.6-27b:instruct": {
          "name": "Qwen3.6 27B (non-thinking mode)",
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": {
            "context": 190208,
            "output": 65536
          },
          "options": {
            "max_tokens": 65536
          }
        }
      }
    }
  }
}

Note the value of "apiKey" should be equal to the key in apiKeys: of llama-swap-config.yaml.

Deploy the model with llama-swap (don’t forget to open port 22345 in the firewall):

llama-swap --config llama-swap-config.yaml --listen 0.0.0.0:22345

Now opencode is configured. Lightweight tasks like title generation use the non-thinking variant of the model.

Performance

The environment for performance testing is: CPU i5-12500H (using integrated graphics for video output), GPU RTX 2080 Ti 22GiB, Kernel Linux 7.0.10-1-cachyos, llama-cpp-turboquant version 9418 2cbfdc6.

MTP significantly improves the decoding performance and overall performance of the model at the cost of slightly increased VRAM usage and a small reduction in prefill speed.

Input tokens	Output tokens (without MTP)	Prefill time/speed (w/o MTP)	Decode time/speed (w/o MTP)	Overall time (w/o MTP)	Output tokens (with MTP)	Prefill time/speed (w MTP)	Decode time/speed (w MTP)	Overall time (w MTP)

30667	1666	52.9 s, 1.72 ms/token, 580.16 token/s	116.6 s, 70.02 ms/token, 14.28 token/s	169.5 s	1556	64.4 s, 2.10 ms/token, 476.14 token/s (speed -17.9%)	53.3 s, 34.22 ms/token, 29.22 token/s (speed +104.6%)	117.7 s
159283	1679	500.8 s, 3.14 ms/token, 318.04 token/s	381.4 s, 227.17 ms/token, 4.40 token/s	882.3 s	1337	579.6 s, 3.64 ms/token, 274.80 token/s (speed -13.6%)	87.4 s, 65.37 ms/token, 15.30 token/s (speed +247.7%)	667.0 s

Performance on non-thinking mode:

Input tokens	Output tokens	VRAM (MiB)	Prefill speed (token/s)	Decode speed (token/s)	Task
0	/	20270	/	/	/
41	1632	29296	158.78	41.15	Problem solving
75	1260	20298	204.50	39.02	Problem solving
15202	15422	20526	513.91	38.69	Translation
30667	1556	20538	476.14	29.22	Text summary
53108	2058	20731	419.84	25.17	Text summary
83743	1612	20954	364.67	20.50	Text summary
106196	1486	21126	333.72	19.39	Text summary
136828	1721	21370	296.02	16.11	Text summary
159283	1337	21542	274.80	15.30	Text summary
189915	293	21770	249.99	14.50	Text summary (output truncated)

Performance on thinking mode:

Input tokens	Output tokens	VRAM (MiB)	Prefill speed (token/s)	Decode speed (token/s)	Task
0	/	20270	/	/	/
39	7887	20344	131.31	38.94	Problem solving
73	10436	20370	207.27	36.13	Problem solving
15200	18345	20550	518.64	35.71	Translation
30665	3043	20550	478.33	28.10	Text summary
53106	3150	20726	422.29	24.45	Text summary
83741	3715	20970	364.77	21.54	Text summary
106194	3330	21142	333.43	19.65	Text summary
136826	3860	21386	297.03	16.79	Text summary
159281	3039	21554	273.92	15.35	Text summary
189913	294	21770	249.73	15.70	Text summary (output truncated)