Research

Original thinking from building AI that runs in production.

A seven-config benchmark, a true 1M-token needle test (PASS), and a 7/8 agentic eval — local, open, reproducible.

A practical retrieval architecture that keeps agents factual over long sessions.

What we learned running 2M evaluations across production systems.

Designing review workflows that don't become the bottleneck.