Browser Agent
Test Suites
Head-to-head comparison of autonomous web agents — tested on real websites with real tasks. Success rates, costs, and failure modes, all independently verified.
Browser
LIVE7 agents · WebVoyager benchmark
Scraping
Coming soon
Search
Coming soon
Voice AI
Coming soon
Inference
Coming soon
Efficiency vs. Cost
Success rate plotted against cost per task
ClawScore Overview
How deployment-ready each agent is on OpenClaw — setup friction, tool compatibility, and uptime stability.
Top ClawScore
Where Agents Struggle
Avg. success rate across all agents by task type
Benchmark Suites
WebVoyager
He et al., 2024
The most widely used benchmark for web agents. 643 tasks across 15 real websites (Google Flights, Amazon, GitHub, etc.) covering navigation, form filling, data extraction, and multi-step workflows.
Tasks
643
Sites
15
Runs
5
Mind2Web
Deng et al., 2023
2,000+ tasks across 137 websites with annotated action sequences. Tests generalization to unseen sites and complex, multi-step interactions — a harder test of real-world readiness.
Tasks
2,000+
Sites
137
Status
Q3 2025
WebVoyager Results
All agents tested on identical task sets with deterministic seeds. Only BrowserUse v3 has a published report — other results release as reports are finalized.
| Agent | Success Rate | Cost / Task | Latency | Error Recovery | ClawScore |
|---|---|---|---|---|---|
|
BU
BrowserUse v3
GPT-4o backbone |
94.2% | $0.042 | 1.2s | 87% | 92 |
|
SH
Stagehand
Browserbase |
91.4% | $0.051 | 1.8s | 81% | 88 |
|
B2
BrowserUse v2
Claude 3.5 Sonnet |
89.8% | $0.058 | 2.4s | 72% | 84 |
|
MO
MultiOn Agent
MultiOn (hosted) |
88.1% | $0.067 | 3.1s | 76% | 81 |
|
SK
Skyvern
Vision-first agent |
85.3% | $0.081 | 4.6s | 69% | 74 |
|
HB
HyperBrowser
Cloud browser infra |
82.7% | $0.039 | 2.8s | 64% | 71 |
|
OP
Operator
OpenAI CUA model |
79.6% | $0.120 | 5.2s | 58% | 63 |
How We Test
Reproducible Runs
Every agent runs the same tasks with pinned seeds and configs. 5 independent runs per agent to measure variance. All configs are open-source — fork and verify.
Automated Grading
Task success is evaluated by comparing final page state against ground-truth criteria — not self-reported by the agent. Human spot-checks validate edge cases.
Neutral & Independent
We don't accept payment for rankings or report placement. Agents are tested on equal footing — same hardware, same network conditions, same evaluation criteria.
Want your agent benchmarked?
We'll run it through our eval harness and publish a full report — same methodology, same standards as every other agent on the platform.