Qwen3.6-27B on AMD/Vulkan Windows desktop
Witnessed 2026-06-19 on node-desktop-b.
Hardware and backend
- CPU: AMD Ryzen 9 9950X, 16 cores / 32 threads, AVX512-capable.
- RAM: 272 GB physical.
- GPU devices seen by llama.cpp b9673 WinGet Vulkan build:
Vulkan0: AMD Radeon RX 7600, 8176 MiB.Vulkan1: AMD Radeon(TM) Graphics, integrated UMA.
- Model:
Qwen3.6-27B-Q4_K_M.gguf, 16,547,398,784 bytes, fromlmstudio-community/Qwen3.6-27B-GGUF. - Server command shape:
llama-server -m Qwen3.6-27B-Q4_K_M.gguf --host 127.0.0.1 --port 8131 --ctx-size 8192 --n-gpu-layers 20 --fit on.
Load result
llama-server loaded the model and exposed /v1/models with:
n_vocab: 248320n_ctx: 8192n_ctx_train: 262144n_embd: 5120n_params: 26895998464size: 16536406016
The backend ran as a hybrid CPU/Vulkan setup, but llama.cpp logged a real perf caveat:
sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device Vulkan0 (usually due to missing support)
fused Gated Delta Net (chunked) not supported, set to disabled
So this proves the model runs here, but also explains why absolute throughput is well below
the M3 Pro llama.cpp reference in QWEN36-PARITY-RESULTS.md.
End-to-end fak surface proof
Full smoke command:
python tools\qwen36_surface_smoke.py `
--base-url http://127.0.0.1:8131/v1 `
--model Qwen3.6-27B-Q4_K_M.gguf `
--node-name amd-vulkan-local-full `
--gateway-chat `
--perf-decode-baseline-tps 7.29 `
--out fak\experiments\qwen36\amd-vulkan-local-full.json `
--markdown fak\experiments\qwen36\amd-vulkan-local-full.md `
--model-timeout-s 600 `
--agent-timeout-s 900 `
--http-timeout-s 20
Result: 3/3 surfaces passed.
| surface | proof |
|---|---|
agent |
live fak agent, one turn per arm, real model Qwen3.6-27B-Q4_K_M.gguf, 3 tool calls per arm, report written |
gateway-openai |
/v1/models listed the model and /v1/chat/completions returned OK |
mcp-http |
initialize, tools/list, and tools/call fak_adjudicate all returned JSON-RPC results |
Artifacts:
fak/experiments/qwen36/amd-vulkan-local-full.jsonfak/experiments/qwen36/amd-vulkan-local-full.mdfak/experiments/qwen36/qwen36-agent-amd-vulkan-local-full.json
Standalone packet path
For another AMD/Vulkan Windows test bench, generate and send the explicit Vulkan packet instead of using the NVIDIA wrapper. The broader multi-GPU serving plan is the A100 model ladder.
python tools\qwen36_node_packet.py --profile vulkan --report-target auto
For a watched Tailscale node, force the packet profile when the node registry does not carry GPU facts:
python tools\qwen36_watch_nodes.py `
--node <tailnet-node> `
--send-packet `
--packet-profile vulkan `
--gateway-chat `
--perf-decode-baseline-tps 7.29
The generated node wrapper runs qwen36_node_server.py --profile vulkan, which keeps
the AMD/llama.cpp shape used above: partial Vulkan offload, --fit on, tailnet-only
bind, preflight-first start, and returned qwen36-reports/ logs.
Performance
Gateway chat report:
- Prompt tokens: 15
- Completion tokens: 125
- Wall time: 46.351 s
- Gateway-estimated decode: 2.697 tok/s
- Ratio to the M3 Pro llama.cpp reference decode bar (7.29 tok/s): 0.37x
llama.cpp server timing for that same gateway chat:
prompt eval time = 1673.25 ms / 15 tokens (8.96 tokens per second)
eval time = 44589.79 ms / 125 tokens (2.80 tokens per second)
total time = 46263.04 ms / 140 tokens
Interpretation:
- End-to-end chat works on this AMD CPU/GPU setup.
- fak gateway overhead is near parity with llama.cpp on the same setup: 2.697 tok/s wall-clock through fak vs 2.80 tok/s from llama.cpp timing, about 0.96x.
- Absolute local decode is not at the M3 Pro reference bar: 2.80 tok/s vs 7.29 tok/s, about 0.38x, due at least in part to llama.cpp’s current Vulkan Gated-DeltaNet fallback.
Pure-fak in-kernel speed parity sweep
After the Q8 hybrid fresh-prefill and head-parallel Gated-DeltaNet scan work, the pure-fak in-kernel runtime now has a direct local microbench against this same WinGet llama.cpp Vulkan build and the same GGUF.
Commands:
go -C fak run ./cmd/modelbench `
-lean `
-gguf C:\Users\USER\.cache\fak-models\gguf\Qwen3.6-27B-Q4_K_M.gguf `
-prefill-sizes 16,64,256 `
-prefill-reps 1 `
-decode-prompt 16 `
-decode-steps 4 `
-decode-reps 1 `
-out experiments\qwen36\native-gguf-q8-hybrid-headscan-p16-64-256-20260619.json
& C:\Users\USER\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\llama-bench.exe `
-m C:\Users\USER\.cache\fak-models\gguf\Qwen3.6-27B-Q4_K_M.gguf `
-p 16,64,256 `
-n 1 `
-r 1 `
-o json > fak\experiments\qwen36\llamacpp-vulkan-qwen36-pp16-64-256-tg1-20260619.json
| Workload | pure-fak Q8 in-kernel | llama.cpp Vulkan b9673 | Ratio |
|---|---|---|---|
| Prefill P16 | 14.86 tok/s | 5.20 tok/s | 2.86x |
| Prefill P64 | 27.46 tok/s | 14.57 tok/s | 1.88x |
| Prefill P256 | 31.34 tok/s | 9.95 tok/s | 3.15x |
| Decode TG1 | 1.24 tok/s | 0.99 tok/s | 1.25x |
Artifacts:
fak/experiments/qwen36/native-gguf-q8-hybrid-headscan-p16-64-256-20260619.jsonfak/experiments/qwen36/native-gguf-q8-hybrid-batch-p16-64-256-20260619.jsonfak/experiments/qwen36/llamacpp-vulkan-qwen36-pp16-64-256-tg1-20260619.json
Extended sweep:
| Workload | pure-fak Q8 in-kernel | llama.cpp Vulkan b9673 | Ratio |
|---|---|---|---|
| Prefill P512 | 30.28 tok/s | 9.32 tok/s | 3.25x |
| Prefill P1024 | 29.67 tok/s | 9.31 tok/s | 3.19x |
| Decode D16 | 1.15 tok/s | 0.99 tok/s | 1.16x |
Extended artifacts:
fak/experiments/qwen36/native-gguf-q8-hybrid-headscan-p512-1024-dp256-d16-20260619.jsonfak/experiments/qwen36/llamacpp-vulkan-qwen36-pp512-1024-tg16-20260619.jsonfak/experiments/qwen36/qwen36-perf-gate-amd-20260619.jsonfak/experiments/qwen36/qwen36-perf-gate-amd-20260619.md
Interpretation:
- Single-stream microbench speed parity is met on this AMD CPU/GPU setup for the measured P16/P64/P256/P512/P1024 prefill points and the measured TG1/D16 decode points.
- The parity statement is machine-checked by
python tools/qwen36_perf_gate.py; the current gate report is PASS at a 1.0x minimum fak/llama.cpp ratio. - This is still not a broad quality/logit-parity claim. The real-artifact logit oracle and longer prompt/decode sweeps remain separate gates.
- The llama.cpp baseline is the WinGet b9673 Vulkan build with
n_gpu_layers=-1; the live server path above may use--fit on --n-gpu-layers 20, so compare artifact-to-artifact.