Live Benchmarking

Browser Agent
Test Suites

Head-to-head comparison of autonomous web agents — tested on real websites with real tasks. Success rates, costs, and failure modes, all independently verified.

language

Browser

LIVE

7 agents · WebVoyager benchmark

local_fire_department

Scraping

Coming soon

Search

Coming soon

mic

Voice AI

Coming soon

bolt

Inference

Coming soon

Efficiency vs. Cost

Success rate plotted against cost per task

TOP PERFORMERS

ALL AGENTS

100% 75% 50% 25% 0%

$0.01 $0.04 $0.07 $0.10 $0.15

BrowserUse v3: 94.2% / $0.042

Stagehand: 91.4% / $0.051

BrowserUse v2: 89.8% / $0.058

MultiOn: 88.1% / $0.067

Skyvern: 85.3% / $0.081

HyperBrowser: 82.7% / $0.039

Operator: 79.6% / $0.120

insights

ClawScore Overview

How deployment-ready each agent is on OpenClaw — setup friction, tool compatibility, and uptime stability.

Top ClawScore

92 BrowserUse v3

Where Agents Struggle

Auth-gated flows 44%

Multi-step checkout 61%

Dynamic dropdowns 68%

Simple navigation 93%

Avg. success rate across all agents by task type

Benchmark Suites

public

WebVoyager

He et al., 2024

ACTIVE

The most widely used benchmark for web agents. 643 tasks across 15 real websites (Google Flights, Amazon, GitHub, etc.) covering navigation, form filling, data extraction, and multi-step workflows.

Tasks

643

Sites

Runs

neurology

Mind2Web

Deng et al., 2023

COMING SOON

2,000+ tasks across 137 websites with annotated action sequences. Tests generalization to unseen sites and complex, multi-step interactions — a harder test of real-world readiness.

Tasks

2,000+

Sites

137

Status

Q3 2025

WebVoyager Results

All agents tested on identical task sets with deterministic seeds. Only BrowserUse v3 has a published report — other results release as reports are finalized.

Agent	Success Rate	Cost / Task	Latency	Error Recovery	ClawScore
BU BrowserUse v3 GPT-4o backbone	94.2%	$0.042	1.2s	87%	92
SH Stagehand Browserbase	91.4%	$0.051	1.8s	81%	88
B2 BrowserUse v2 Claude 3.5 Sonnet	89.8%	$0.058	2.4s	72%	84
MO MultiOn Agent MultiOn (hosted)	88.1%	$0.067	3.1s	76%	81
SK Skyvern Vision-first agent	85.3%	$0.081	4.6s	69%	74
HB HyperBrowser Cloud browser infra	82.7%	$0.039	2.8s	64%	71
OP Operator OpenAI CUA model	79.6%	$0.120	5.2s	58%	63

How We Test

replay

Reproducible Runs

Every agent runs the same tasks with pinned seeds and configs. 5 independent runs per agent to measure variance. All configs are open-source — fork and verify.

gavel

Automated Grading

Task success is evaluated by comparing final page state against ground-truth criteria — not self-reported by the agent. Human spot-checks validate edge cases.

balance

Neutral & Independent

We don't accept payment for rankings or report placement. Agents are tested on equal footing — same hardware, same network conditions, same evaluation criteria.

Read the full methodology arrow_forward

Want your agent benchmarked?

We'll run it through our eval harness and publish a full report — same methodology, same standards as every other agent on the platform.

Request a Benchmark View Methodology

Browser

Scraping

Search

Voice AI

Inference

Efficiency vs. Cost

ClawScore Overview

Where Agents Struggle

Benchmark Suites

WebVoyager

Mind2Web

WebVoyager Results

How We Test

Reproducible Runs

Automated Grading

Neutral & Independent

Want your agent benchmarked?

Request a Benchmark

Submit Your Agent