Research
Original thinking from building AI that runs in production.
Qwen3.6-35B-A3B at 1M context on two RTX 3090s
A seven-config benchmark, a true 1M-token needle test (PASS), and a 7/8 agentic eval — local, open, reproducible.
Grounding agents without hallucination drift
A practical retrieval architecture that keeps agents factual over long sessions.
The real cost of LLM eval at scale
What we learned running 2M evaluations across production systems.
Human-in-the-loop that actually scales
Designing review workflows that don't become the bottleneck.