Skip to the content.

fak’s own engine runs Qwen3.6-27B (qwen35) — witnessed on M3

The headline for the goal lane: fak’s own in-kernel forward pass — not llama.cpp — now runs the Qwen3.5/Qwen3.6 hybrid Gated-DeltaNet architecture end-to-end in chat on Apple M3 Pro. The 0.8B safetensors run is the coherent f32 architecture witness; the 27B q4_k_m GGUF now also loads and generates through fak’s pure in-kernel path via the GGUF->Q8 cached GDN runtime.

Witness (reproducible)

$ /tmp/fakchat -hf ~/.cache/fak-models/qwen3.5-0.8b \
    -p "What is the capital of France? Answer in one short sentence." -n 40
model=qwen3_5  load=1203ms  prompt_tokens=32  backend=fak in-kernel Gated-DeltaNet (f32, cacheless)
<think>

</think>

Paris is the capital of France.

Correct, coherent, and it even emits the Qwen <think> block. The full path is fak’s own: internal/tokenizer (Encode/Decode, oracle-validated) → ChatML template → model.Forward running the Gated-DeltaNet linear-attention scan + gated full-attention (internal/model/qwen35.go, ported from transformers Qwen3_5GatedDeltaNet) → sampling → detokenize → stream.

2026-06-19 cached-decode refresh

The original M3 chat witness above was the cacheless path. Current fak now routes the Qwen3.5/Qwen3.6 f32 safetensors path through Session.Prefill / Session.Step: full-attention KV and the Gated-DeltaNet recurrent conv/state both live in the session. That makes cmd/fakchat and cmd/qwen35check use cached decode instead of rerunning whole-sequence Forward for each generated token. Current unit witnesses are TestQwen35HybridSessionMatchesForwardAndPersistsState and TestQwen35HybridQuantTokenLoopPersistsState.

2026-06-19 Qwen3.6-27B pure-fak GGUF witness

This is the real 27B artifact on this M3 Pro, with no llama-server, no external OpenAI-compatible proxy, and no llama.cpp in the execution path:

/usr/bin/time -l go -C fak run ./cmd/fakchat \
  -gguf /Users/USER/.cache/fak-models/gguf/Qwen3.6-27B.q4_k_m.gguf \
  -tok /Users/USER/.cache/fak-models/tokenizers/qwen3.6 \
  -p "Say OK." \
  -n 1

Observed output:

model=qwen35  load=75505ms  prompt_tokens=22  backend=fak in-kernel Gated-DeltaNet (GGUF->Q8, cached)
<think>
---
prefill: 22 tok in 40.62s (0.5 tok/s)  |  cached qwen3_5 decode: 1 tok in 16.25s (0.1 tok/s)
      135.67 real       339.68 user        97.23 sys
         25785204736  maximum resident set size

What this proves:

What this does not claim yet:

How it works today

The 27B-size status on a 36 GB box (precise)

The f32 path is still too large for 36 GB, but the GGUF->Q8 path now runs:

Path Footprint Fits 36 GB? Missing
f32 (LoadSafetensorsDir, the validated GDN path) 27B×4 ≈ 108 GB
GGUF->Q8 cached runtime observed 25.8 GB RSS speed/broader logit-parity work
native q4 GGUF runtime ≈ 16 GB ✓ in principle direct q4 kernels / no Q8 expansion

The size gate is closed for an end-to-end command-line smoke. The remaining gap is performance/correctness evidence at the real-artifact level: load-time reduction (#95), direct q4 residency (#96), GDN/full-attention phase profiling and acceleration (#97), device prefill for the GDN/full-attention projections (#92), and a short llama.cpp/HF oracle to prove logits rather than just execution (#93).

Status summary

For DGX and standalone endpoint-backed test benches, use a multi-GPU A100 serving host.

Witnessed 2026-06-18 and refreshed 2026-06-19 on Apple M3 Pro (36 GB). fak rows are fak’s own forward pass.