Research — Mersault

Coming 2026

Voice agent benchmark:
the first public evaluation

A large-scale evaluation of voice AI agents across real customer service workflows — scored against live call outcomes, not simulated transcripts. More details to follow.

Forthcoming Results locked

Coming soon

Why agent benchmarks
need production data

Most benchmarks are built from synthetic tasks. We argue that the only benchmark worth running is one grounded in what your agents actually do — and what it costs when they get it wrong.

Methodology Coming soon

Jan 2026

Browser agent evaluation:
643 tasks, three scoring methods

A completed large-scale run across 643 web navigation tasks, evaluated under three different scoring methodologies. Results available to design partners on request.

Results Request access →

More findings on the way.

We publish selectively — when the results are worth publishing. Leave your email and we'll reach out when new research is available.

Let's talk aboutyour benchmark.

Findings from theenvironments we run.

More findings on the way.

Let's talk about
your benchmark.

Findings from the
environments we run.