Skip to the content.

Qwen3.6-27B on AMD/Vulkan Windows desktop

Witnessed 2026-06-19 on node-desktop-b.

Hardware and backend

Load result

llama-server loaded the model and exposed /v1/models with:

The backend ran as a hybrid CPU/Vulkan setup, but llama.cpp logged a real perf caveat:

sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device Vulkan0 (usually due to missing support)
fused Gated Delta Net (chunked) not supported, set to disabled

So this proves the model runs here, but also explains why absolute throughput is well below the M3 Pro llama.cpp reference in QWEN36-PARITY-RESULTS.md.

End-to-end fak surface proof

Full smoke command:

python tools\qwen36_surface_smoke.py `
  --base-url http://127.0.0.1:8131/v1 `
  --model Qwen3.6-27B-Q4_K_M.gguf `
  --node-name amd-vulkan-local-full `
  --gateway-chat `
  --perf-decode-baseline-tps 7.29 `
  --out fak\experiments\qwen36\amd-vulkan-local-full.json `
  --markdown fak\experiments\qwen36\amd-vulkan-local-full.md `
  --model-timeout-s 600 `
  --agent-timeout-s 900 `
  --http-timeout-s 20

Result: 3/3 surfaces passed.

surface proof
agent live fak agent, one turn per arm, real model Qwen3.6-27B-Q4_K_M.gguf, 3 tool calls per arm, report written
gateway-openai /v1/models listed the model and /v1/chat/completions returned OK
mcp-http initialize, tools/list, and tools/call fak_adjudicate all returned JSON-RPC results

Artifacts:

Standalone packet path

For another AMD/Vulkan Windows test bench, generate and send the explicit Vulkan packet instead of using the NVIDIA wrapper. The broader multi-GPU serving plan is the A100 model ladder.

python tools\qwen36_node_packet.py --profile vulkan --report-target auto

For a watched Tailscale node, force the packet profile when the node registry does not carry GPU facts:

python tools\qwen36_watch_nodes.py `
  --node <tailnet-node> `
  --send-packet `
  --packet-profile vulkan `
  --gateway-chat `
  --perf-decode-baseline-tps 7.29

The generated node wrapper runs qwen36_node_server.py --profile vulkan, which keeps the AMD/llama.cpp shape used above: partial Vulkan offload, --fit on, tailnet-only bind, preflight-first start, and returned qwen36-reports/ logs.

Performance

Gateway chat report:

llama.cpp server timing for that same gateway chat:

prompt eval time = 1673.25 ms / 15 tokens (8.96 tokens per second)
eval time        = 44589.79 ms / 125 tokens (2.80 tokens per second)
total time       = 46263.04 ms / 140 tokens

Interpretation:

Pure-fak in-kernel speed parity sweep

After the Q8 hybrid fresh-prefill and head-parallel Gated-DeltaNet scan work, the pure-fak in-kernel runtime now has a direct local microbench against this same WinGet llama.cpp Vulkan build and the same GGUF.

Commands:

go -C fak run ./cmd/modelbench `
  -lean `
  -gguf C:\Users\USER\.cache\fak-models\gguf\Qwen3.6-27B-Q4_K_M.gguf `
  -prefill-sizes 16,64,256 `
  -prefill-reps 1 `
  -decode-prompt 16 `
  -decode-steps 4 `
  -decode-reps 1 `
  -out experiments\qwen36\native-gguf-q8-hybrid-headscan-p16-64-256-20260619.json

& C:\Users\USER\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\llama-bench.exe `
  -m C:\Users\USER\.cache\fak-models\gguf\Qwen3.6-27B-Q4_K_M.gguf `
  -p 16,64,256 `
  -n 1 `
  -r 1 `
  -o json > fak\experiments\qwen36\llamacpp-vulkan-qwen36-pp16-64-256-tg1-20260619.json
Workload pure-fak Q8 in-kernel llama.cpp Vulkan b9673 Ratio
Prefill P16 14.86 tok/s 5.20 tok/s 2.86x
Prefill P64 27.46 tok/s 14.57 tok/s 1.88x
Prefill P256 31.34 tok/s 9.95 tok/s 3.15x
Decode TG1 1.24 tok/s 0.99 tok/s 1.25x

Artifacts:

Extended sweep:

Workload pure-fak Q8 in-kernel llama.cpp Vulkan b9673 Ratio
Prefill P512 30.28 tok/s 9.32 tok/s 3.25x
Prefill P1024 29.67 tok/s 9.31 tok/s 3.19x
Decode D16 1.15 tok/s 0.99 tok/s 1.16x

Extended artifacts:

Interpretation: