使用 RTX 2080 Ti 22GiB 部署 Qwen 3.6-27B 语言模型

Posted on 2026-05-31 • 1847 words • 4 minute read

Tags: 知识库, CachyOS, Linux, Qwen, 语言模型

Qwen 3.6-27B 是性能强劲、适合本地部署的语言模型。借助量化、MTP 和 TurboQuant 等技术，该模型可以在 RTX 2080 Ti 22GiB 上部署并流畅运行。

以下步骤在运行 CachyOS 最新版（使用 260426 ISO 安装）的电脑上测试通过。

安装依赖

sudo pacman -Syu cmake cuda nodejs npm
paru -Sy python-modelscope

必要时配置 npm 镜像：

npm config set registry https://mirrors.cloud.tencent.com/npm/
npm config set strict-ssl false

安装后建议重启或重新登录，以刷新环境变量。

编译 `llama-cpp-turboquant`

cd
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant

# 2026-06-17 更新：llama-cpp-turboquant 的当前最新版（9901 35ac80d）的解码性能问题已经修复。
# 该版本显存占用稍高（参见 https://github.com/TheTom/llama-cpp-turboquant/issues/177），
# 上下文长度可能需要调整。
#
# 2026-06-11 更新：llama-cpp-turboquant 的当前最新版（9450 73eb521）的解码存在一定的性能问题。
# 建议使用 9438 ab11a71 版，该版的预填充性能比性能测试中使用的 9418 2cbfdc6 版略有提升。
# git checkout ab11a71

# 下面三行用于编译 Web UI 前端
cd tools/ui
npm i
npm run build

cd ../..
CUDACXX="/opt/cuda/bin/nvcc" cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_WEBUI=ON && cmake --build build --config Release -j --target llama-server llama-cli

编译产物在 build/bin 目录下。

下载模型

modelscope download --model Tariel/Qwen3.6-27B-4bpw-MTP.gguf --local_dir ~/Qwen3.6

部署

cd ~/llama-cpp-turboquant/build/bin
./llama-server -m ~/Qwen3.6/Qwen3.6-27B-4bpw-MTP.gguf \
    -mm ~/Qwen3.6/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf \
    -ngl all -fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu \
    --ctx-checkpoints 64 --threads-http 2 \
    -ctk q8_0 -ctv turbo3 \
    -c 190208 \
    --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \
    --host 0.0.0.0 --port 22345 \
    --api-key "sk-jintianshifengkuangxingqisivivo50" \
    -a qwen3.6-27b \
    --jinja --chat-template-file ~/Qwen3.6/chat_template.jinja \
    --reasoning on --reasoning-format deepseek

注意：请按需修改 --api-key 后面的 API key 的内容。如果不需要限制访问，则去掉 --api-key 选项。

使用

此时用浏览器访问主机的 22345 端口（例如 http://127.0.0.1:22345）即可使用 llama.cpp Web UI。注意如果在其他电脑访问部署了模型的主机，需要打开 CachyOS 自带的防火墙：

sudo ufw allow 22345

llama-server 兼容 OpenAI 和 Anthropic 格式的 API，可以直接在常见的 AI 客户端软件中使用。

聊天模板的开关思考功能

模型附带的 chat_template.jinja 聊天模板（修改自 BoFan-tunning/llama.cpp-MTP-TurboQuant，为了在使用 opencode 等编程工具时能够稳定复用上下文缓存，去掉了自动开关思考功能）可以在输入中插入 <|think_off|> 关闭思考，插入 <|think_on|> 打开思考。

使用 Cherry Studio 时的注意事项

使用 Cherry Studio 时，为正确开关思考模式，建议选择“英伟达”为提供商，API 地址填写 llama-server 的地址（详见 Issue #14981)。

配置 OpenCode

首先安装 opencode。安装 llama-swap 来给模型创建思考和非思考的两个变体。建议同时安装 rtk 以节约 token 使用。

paru -Sy opencode-bin llama-swap-bin rtk
rtk init -g --opencode	# 为 opencode 初始化 rtk

创建 llama-swap-config.yaml：

captureBuffer: 20
performance:
  every: 10s
startPort: 60001
sendLoadingState: true

macros:
  "llama-turboquant": >
    /home/user/llama-cpp-turboquant-bin/llama-server --port ${PORT}

  "qwen_dir": "/home/user/Qwen3.6"

apiKeys:
  - "sk-jintianshifengkuangxingqisivivo50"

models:
  "qwen3.6-27b":
    cmd: |
      ${llama-turboquant}
      -m ${qwen_dir}/Qwen3.6-27B-4bpw-MTP.gguf
      -mm ${qwen_dir}/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf
      -ngl all -fa on --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -kvu
      --ctx-checkpoints 64 --threads-http 2
      -ctk q8_0 -ctv turbo3
      -c 190208
      --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0
      --jinja --chat-template-file ${qwen_dir}/chat_template.jinja
      --reasoning on --reasoning-format deepseek

    name: "Qwen 3.6 27B"
    ttl: 0
    filters:
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
    concurrencyLimit: 2
    timeouts:
      connect: 30
      keepalive: 30
      responseHeader: 60
      tlsHandshake: 10
      idleConn: 90

hooks:
  on_startup:
    preload:
      - "qwen3.6-27b"

注意替换 macros: 中 "llama-turboquant" 与 "qwen_dir" 的路径，以及 apiKeys。

修改 ~/.config/opencode/opencode.json：

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llama.cpp/qwen3.6-27b:thinking",
  "small_model": "llama.cpp/qwen3.6-27b:instruct",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (Local)",
      "options": {
        "baseURL": "http://localhost:22345/v1",
        "apiKey": "sk-jintianshifengkuangxingqisivivo50"
      },
      "models": {
        "qwen3.6-27b:thinking": {
          "name": "Qwen3.6 27B (thinking mode)",
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": {
            "context": 190208,
            "output": 65536
          },
          "options": {
            "max_tokens": 65536
          }
        },
        "qwen3.6-27b:instruct": {
          "name": "Qwen3.6 27B (non-thinking mode)",
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": {
            "context": 190208,
            "output": 65536
          },
          "options": {
            "max_tokens": 65536
          }
        }
      }
    }
  }
}

注意 "apiKey"的值与 llama-swap-config.yaml 的 apiKeys: 中一致。

使用 llama-swap 进行部署（别忘了打开防火墙上的 22345 端口）：

llama-swap --config llama-swap-config.yaml --listen 0.0.0.0:22345

此时 opencode 可正常使用。创建标题等使用 small_model 的简单任务会调用模型的非思考变体。

性能

性能测试的环境为：CPU i5-12500H (视频输出使用核显), GPU RTX 2080 Ti 22GiB, 内核 Linux 7.0.10-1-cachyos， llama-cpp-turboquant 9418 2cbfdc6 版。

MTP 以少量的显存占用和略微降低预填充速度为代价，显著提高了模型的解码性能和总体性能。

输入上下文 token 数	输出上下文 token 数（无 MTP）	预填充时间/速度（无 MTP）	解码时间/速度（无 MTP）	总时间（无 MTP）	输出上下文 token 数（有 MTP）	预填充时间/速度（有 MTP）	解码时间/速度（有 MTP）	总时间（有 MTP）

30667	1666	52.9 s, 1.72 ms/token, 580.16 token/s	116.6 s, 70.02 ms/token, 14.28 token/s	169.5 s	1556	64.4 s, 2.10 ms/token, 476.14 token/s (speed -17.9%)	53.3 s, 34.22 ms/token, 29.22 token/s (speed +104.6%)	117.7 s
159283	1679	500.8 s, 3.14 ms/token, 318.04 token/s	381.4 s, 227.17 ms/token, 4.40 token/s	882.3 s	1337	579.6 s, 3.64 ms/token, 274.80 token/s (speed -13.6%)	87.4 s, 65.37 ms/token, 15.30 token/s (speed +247.7%)	667.0 s

使用 MTP、思考模式关闭时，不同任务类型的性能如下：

输入上下文长度 (token)	输出长度 (token)	显存占用 (MiB)	预填充速度 (token/s)	解码速度 (token/s)	任务类型
0	/	20270	/	/	/
41	1632	29296	158.78	41.15	解题
75	1260	20298	204.50	39.02	解题
15202	15422	20526	513.91	38.69	翻译
30667	1556	20538	476.14	29.22	长文本总结
53108	2058	20731	419.84	25.17	长文本总结
83743	1612	20954	364.67	20.50	长文本总结
106196	1486	21126	333.72	19.39	长文本总结
136828	1721	21370	296.02	16.11	长文本总结
159283	1337	21542	274.80	15.30	长文本总结
189915	293	21770	249.99	14.50	长文本总结（输出被截断）

思考模式打开时：

输入上下文长度 (token)	输出长度 (token)	显存占用 (MiB)	预填充速度 (token/s)	解码速度 (token/s)	任务类型
0	/	20270	/	/	/
39	7887	20344	131.31	38.94	解题
73	10436	20370	207.27	36.13	解题
15200	18345	20550	518.64	35.71	翻译
30665	3043	20550	478.33	28.10	长文本总结
53106	3150	20726	422.29	24.45	长文本总结
83741	3715	20970	364.77	21.54	长文本总结
106194	3330	21142	333.43	19.65	长文本总结
136826	3860	21386	297.03	16.79	长文本总结
159281	3039	21554	273.92	15.35	长文本总结
189913	294	21770	249.73	15.70	长文本总结（输出被截断）