Coming 2026
Voice agent benchmark:
the first public evaluation
A large-scale evaluation of voice AI agents across real customer service workflows — scored against live call outcomes, not simulated transcripts. More details to follow.
Forthcoming Results locked
Coming soon
Why agent benchmarks
need production data
Most benchmarks are built from synthetic tasks. We argue that the only benchmark worth running is one grounded in what your agents actually do — and what it costs when they get it wrong.
Methodology Coming soon
Jan 2026
Browser agent evaluation:
643 tasks, three scoring methods
A completed large-scale run across 643 web navigation tasks, evaluated under three different scoring methodologies. Results available to design partners on request.

More findings on the way.

We publish selectively — when the results are worth publishing. Leave your email and we'll reach out when new research is available.

Run a benchmark.
Get real findings.

We work with enterprise teams and AI-native companies deploying agents in production.

Get in touch →