Web Bench introduces a new dataset for evaluating AI browser agents, consisting of 5,750 tasks across 452 websites. The dataset aims to address limitations in existing benchmarks by focusing on both read and write tasks, revealing that agents struggle significantly with write-heavy tasks like form filling and authentication, while performing better on read tasks. Skyvern 2.0 currently leads in performance for write tasks, highlighting opportunities for improvement in AI browser capabilities.
web-benchmark ✓
ai-agents ✓
performance-evaluation ✓
+ dataset
browser-automation ✓