Community · numaya.ai

We benchmarked Qwen3.6-35B-A3B on llama.cpp TurboQuant across seven configurations and three context depths, then ran a full 1M-token needle-in-haystack test and an eight-test agentic eval battery — all on a single workstation with two RTX 3090s (48 GB VRAM total). Config D wins.

The challenge

True million-token context is normally the preserve of datacentre GPUs. We wanted to know whether a quantised open model could hold a real 1M-token window — not a theoretical limit, but a prompt that actually fills it — on hardware that fits under a desk, while staying coherent enough to drive agentic tool-calling.

Config D

Model: Qwen3.6-35B-A3B Q6_K (26.6 GB, near-lossless)
Engine: llama-cpp-turboquant v0.1.1
KV cache: K = q8_0 / V = turbo3
Context: 1,048,576 tokens (1M), ~44 GB / 48 GB VRAM
Agentic eval: 7/8 (the one failure is a llama.cpp engine limit, not the model)

1M context — PASS

A needle-in-haystack run pushed 1,038,653 prompt tokens (99% of the 1M limit). The needle was recovered as an exact match in ~32 minutes with no crash or OOM. Config D is stable at true 1M context.

Why only 10 KV layers

Qwen3.6-35B-A3B is a hybrid: of its 40 layers, only 10 are full-attention — the other 30 are Gated DeltaNet (SSM-like) with fixed recurrent state and no KV cache. That makes KV VRAM roughly 4× lower than a naive 40-layer estimate, which is what makes 1M context fit in 48 GB at all.

Key finding

TurboQuant KV is ~38–46% slower than q4_0 KV: "turbo" means high compression, not higher speed. Use q4_0/q4_0 for maximum tokens/sec; use q8_0/turbo3 (Config D) for maximum quality and stability at 1M.

Full benchmark matrix, scripts, and the agentic eval harness are open-source.

View the code + benchmark harness on GitHub ↗

← All research