Skip to the content.

Hardware matrix — every machine fak has been profiled on

The point of this page: fak’s correctness and serving claims are not from one lucky box. The same pure-Go kernel — same bit-exact gates — has been run and benchmarked across four distinct hardware platforms spanning two CPU ISAs (arm64 + x86_64), three CPU vendors (Apple · AMD · Intel), four GPU backends (Apple Metal · AMD Vulkan · NVIDIA CUDA Ada and Ampere), and four OS targets (macOS · Windows · WSL2 Linux · Linux). Portability across that spread is a result — a kernel that owns the KV cache as its own object has to prove it stays correct on every one.

Every number on this page traces to a committed artifact via the single source of truth, fak/BENCHMARK-AUTHORITY.md. This page is the rollup — the at-a-glance “how serious is this” view — not a new claim. Where a number appears here it carries a pointer to the doc + commit that owns it.

Lineage: rolled up 2026-06-21 · fak v0.30.0 · against BENCHMARK-AUTHORITY + MODEL-LADDER-VS-SOTA-2026-06-21 + the per-platform results docs linked below.

Hardware coverage matrix — four platforms across two CPU ISAs, four GPU backends, four operating systems


The coverage matrix

Platform CPU / ISA GPU backend OS Quant coverage What’s proven here
Apple M3 Pro (primary bench node) Apple M3 Pro 6P+6E, arm64 18-core Metal + NEON Q8 macOS f32 · Q8_0 · Q4_K · Q2_K Full model ladder, the agent-fleet value stack, the pure-kernel latency stack, Qwen3.6-27B end-to-end in fak’s own engine
AMD Ryzen 9 9950X + Radeon RX 7600 AMD Zen 5, 16C/32T, x86_64, AVX-512 Vulkan Q8 (RX 7600) + CPU Q8 Windows f32 · Q8_0 · Q4_K Q8-on-GPU throughput, the GPU/CPU crossover, 3/3 live agent surfaces on Qwen3.6-27B
Intel x86_64 + NVIDIA RTX 4070 Intel, x86_64, AVX2/AVX-512 CUDA (Ada, sm_89): f32 · F16 · Q8 · graph Windows + WSL2 Linux f32 · F16 · Q8_0 In-kernel CUDA decode at llama.cpp parity, batched decode curve, cross-platform bit-exact determinism vs the Mac
an 8-GPU datacenter server (serving lane) x86_64 host CUDA (Ampere, sm_80), multi-GPU Linux Q4_K · (FP8/BF16 target) The multi-GPU serving + GLM-5.2 readiness lane — big-iron, where single-box ceilings stop binding

Reading the spread: the deterministic results (token-count speedups, cache hit rate, bit-exact eviction) are hardware-independent by construction and reproduce byte-for-byte across these boxes. The wall-clock numbers are per-box and stay labeled as such. The fact that the same kernel binary’s correctness gates pass on Metal, Vulkan, and two CUDA generations is the portability claim this matrix exists to make visible.


Platform 1 — Apple M3 Pro (node-macos-a) · the primary bench node

The box almost every published fak number is measured on.

Component Spec
CPU Apple M3 Pro — 6 performance + 6 efficiency = 12 cores, arm64
GPU 18-core GPU (Metal 4)
Memory 36 GB unified (CPU+GPU shared), ~150 GB/s
OS / toolchain macOS · Go 1.26.0
Backends NEON Q8 (arm64 SIMD), Metal GPU (-tags fakmetal), pure-Go HF/GGUF loaders

What’s been profiled here:


Platform 2 — AMD Ryzen 9 9950X + Radeon RX 7600 · the Vulkan lane

Proves the GPU path is not NVIDIA-only — the same HAL device backend runs on a second vendor’s GPU through Vulkan. This box also serves as the agent / control host for the fleet: by the run-on-bench-nodes-by-default placement law it is kept for agent use (registered with role: agent-host), so routine hardware-benchmark runs are dispatched to the dedicated bench nodes over Tailscale rather than measured here.

Component Spec
CPU AMD Ryzen 9 9950X — 16 cores / 32 threads, x86_64, AVX-512
Memory 256 GB
GPU AMD Radeon RX 7600 (8 GB, Vulkan 1.4) + integrated UMA
OS Windows 11 (native Vulkan)
Backends Vulkan Q8 device GEMM, CPU Q8 (AVX-512)
Role agent / control host (role: agent-host) — kept for agent use, not a default bench target

What’s been profiled here:


Platform 3 — Intel x86_64 + NVIDIA RTX 4070 · the CUDA / WSL2 lane

The “go all in, fused kernel on a GPU that fits the model” lane, plus the cross-platform determinism check that proves the deterministic metrics are not arm64-specific.

Component Spec
CPU Intel, x86_64, AVX2 / AVX-512
GPU NVIDIA RTX 4070 (Ada, sm_89), CUDA 12.6
OS Windows 11 (native) + WSL2 Ubuntu (CUDA via user-space micromamba)
Backends CUDA f32 · CUDA F16 (tensor cores, cuBLAS) · CUDA Q8 (W8A16) · CUDA Graph · CPU Q8

What’s been profiled here:


Platform 4 — an 8-GPU datacenter server · the multi-GPU serving lane

The big-iron lane: ~320 GB of GPU on a DGX-class node, where the single-box memory ceilings (fak faithful ≤ 7B on the 36 GB Mac) stop binding and the questions become multi-GPU serving and frontier-model readiness.

Component Spec
GPU an 8-GPU datacenter server (Ampere, sm_80), ~320 GB aggregate
Host x86_64, Linux
Backends CUDA (sm_80), multi-GPU serving target

What’s documented here:

Honesty fence. This lane is reported as the documented serving/readiness track, not a published single-box throughput row — the per-rung wall-clock witnesses live behind the same DOS verification discipline as everything else and are gated on the serving work landing. No A100 tok/s figure is asserted here that isn’t traced to an artifact in BENCHMARK-AUTHORITY.md.


Why this many machines

It would be cheaper to benchmark on one box and call it done. fak profiles across this spread on purpose:

  1. Portability is a correctness claim. Because fak owns the KV cache as a kernel object (not rented from a serving engine), its bit-exact eviction and prefix-reuse guarantees have to hold on every backend — Metal, Vulkan, and both CUDA generations. Running the same gates on four platforms is how that claim is kept honest.
  2. Two regimes need two kinds of hardware. The single-stream ceiling (≤7B on 36 GB) is a small-box story; the multi-agent fleet win and the frontier-model serving lane need the AMD/CUDA desktops and the GPU node respectively.
  3. The deterministic metrics must be machine-independent. The token-count speedups and cache hit rates are claimed as hardware-independent — the cross-platform Mac↔Windows reproduction is the witness that they actually are.

See also

</content> </invoke>