Benchmark
Domain
Tasks
Best score
WebVoyager
End-to-end web agent evaluation across 15 real websites using GPT-4V as judge
Browser & web
643
97.1%
View →
WebArena
Realistic self-hosted web environment across e-commerce, forums, and productivity tools
Browser & web
812
61.7%
View →
Mind2Web
Generalist web agent benchmark across diverse real-world tasks and websites
Browser & web
2,350
View →
ToolBench
Instruction-following benchmark for LLMs using 16,000+ real-world APIs
Tool use
3,451
View →
τ-bench
Dynamic tool-agent-user conversation benchmark in retail and airline domains
Tool use
1,190
View →
SWE-bench Verified
Human-validated real GitHub issues requiring code changes across Python repositories
Coding
500
79.2%
View →
HumanEval
Hand-written Python programming problems evaluating functional correctness — near-saturated
Coding
164
~99%
View →
GAIA
Real-world questions requiring multi-step reasoning, web browsing, and tool use
Reasoning
466
View →
AgentBench
Multi-environment evaluation of LLMs as agents across OS, databases, web, and games
Reasoning
1,365
View →
OSWorld
Desktop task benchmark across real apps on Ubuntu, Windows, and macOS — superhuman SOTA
Multimodal
369
76.3%
View →
ScreenSpot
GUI grounding benchmark testing visual element location across platforms and app types
Multimodal
1,272
View →
Custom benchmarks
Domain-specific environments built on your production workflows — private to your team, or published as an industry standard. Your call.
Private or public
Request →
Need a benchmark
that doesn't exist yet?

We build benchmarks — private or public — for any industry where agents are being deployed. Request one or become a data partner.

Get in touch