We benchmarked Qwen3.6-35B-A3B on llama.cpp TurboQuant across seven configurations and three context depths, then ran a full 1M-token needle-in-haystack test and an eight-test agentic eval battery — all on a single workstation with two RTX 3090s (48 GB VRAM total). Config D wins.
The challenge
True million-token context is normally the preserve of datacentre GPUs. We wanted to know whether a quantised open model could hold a real 1M-token window — not a theoretical limit, but a prompt that actually fills it — on hardware that fits under a desk, while staying coherent enough to drive agentic tool-calling.
Config D
- Model: Qwen3.6-35B-A3B Q6_K (26.6 GB, near-lossless)
- Engine: llama-cpp-turboquant v0.1.1
- KV cache: K =
q8_0/ V =turbo3 - Context: 1,048,576 tokens (1M), ~44 GB / 48 GB VRAM
- Agentic eval: 7/8 (the one failure is a llama.cpp engine limit, not the model)
1M context — PASS
A needle-in-haystack run pushed 1,038,653 prompt tokens (99% of the 1M limit). The needle was recovered as an exact match in ~32 minutes with no crash or OOM. Config D is stable at true 1M context.
Why only 10 KV layers
Qwen3.6-35B-A3B is a hybrid: of its 40 layers, only 10 are full-attention — the other 30 are Gated DeltaNet (SSM-like) with fixed recurrent state and no KV cache. That makes KV VRAM roughly 4× lower than a naive 40-layer estimate, which is what makes 1M context fit in 48 GB at all.
Key finding
TurboQuant KV is ~38–46% slower than q4_0 KV: "turbo" means high compression,
not higher speed. Use q4_0/q4_0 for maximum tokens/sec; use q8_0/turbo3
(Config D) for maximum quality and stability at 1M.
Full benchmark matrix, scripts, and the agentic eval harness are open-source.