Coming 2026
Voice agent benchmark:
the first public evaluation
the first public evaluation
A large-scale evaluation of voice AI agents across real customer service workflows — scored against live call outcomes, not simulated transcripts. More details to follow.
Forthcoming
Results locked
Coming soon
Why agent benchmarks
need production data
need production data
Most benchmarks are built from synthetic tasks. We argue that the only benchmark worth running is one grounded in what your agents actually do — and what it costs when they get it wrong.
Methodology
Coming soon
Jan 2026
Browser agent evaluation:
643 tasks, three scoring methods
643 tasks, three scoring methods
A completed large-scale run across 643 web navigation tasks, evaluated under three different scoring methodologies. Results available to design partners on request.
Results
Request access →
More findings on the way.
We publish selectively — when the results are worth publishing. Leave your email and we'll reach out when new research is available.