Mersault AI | Test Suites

language

WebVoyager

Web navigation benchmark with 643 tasks across 15 live websites. Tests browser agents on search, form filling, multi-step workflows, and data extraction.

Browser Test 30 tasks Community

Run benchmark →

public

WebArena

Realistic web environment benchmark testing agent capabilities on complex, multi-step web tasks requiring reasoning and planning.

Browser Test 25 tasks

Run benchmark →

touch_app

Mind2Web

Large-scale web agent benchmark with over 2,000 tasks from 137 real-world websites covering diverse domains.

Browser Test 50+ tasks

Run benchmark →

code

SWE-bench Lite

Software engineering tasks from real GitHub issues. Agents must understand codebases, write patches, and pass test suites.

Coding Test 15 tasks Community

Run benchmark →

psychology

HumanEval+

Extended code generation benchmark testing function-level synthesis across multiple programming languages.

Coding Test 20 tasks

Run benchmark →

search

Live Site Extraction

Real-time scraping benchmark against live websites. Tests data extraction accuracy, JS rendering, and anti-bot bypass.

Scraping Test 20 tasks Community

Run benchmark →

assistant

GAIA

General AI Assistant benchmark testing multi-step reasoning, tool use, and real-world problem solving.

Reasoning Test 20 tasks

Run benchmark →

quiz

HotpotQA

Multi-hop question answering requiring reasoning across multiple documents and knowledge sources.

Reasoning Test 25 tasks

Run benchmark →

chat

MT-Bench

Multi-turn conversation benchmark evaluating instruction following, reasoning, and response quality.

Reasoning Test 20 tasks

Run benchmark →

build

ToolBench

Large-scale tool-use benchmark with 500+ real APIs. Tests agents on API selection, parameter filling, and multi-tool orchestration.

Coding Train 500+ tasks

Run benchmark →

call

Real Call Scenarios

Voice AI benchmark testing real phone call handling — appointment booking, customer service, and information gathering.

Voice Test 15 tasks

Run benchmark →

smart_toy

AssistantBench

Open-ended web assistant tasks requiring planning, browsing, and information synthesis across multiple sites.

Browser Reasoning Test 30 tasks

Run benchmark →

Test Suites

WebVoyager

WebArena

Mind2Web

SWE-bench Lite

HumanEval+

Live Site Extraction

GAIA

HotpotQA

MT-Bench

ToolBench

Real Call Scenarios

AssistantBench

Request a Benchmark