The study evaluates the capabilities of autonomous web agents based on large language models, revealing a disparity between perceived and actual competencies due to flaws in current benchmarks. It introduces Online-Mind2Web, a new evaluation benchmark comprising 300 tasks across 136 websites, and presents a novel LLM-as-a-Judge method that aligns closely with human assessment. The findings highlight the strengths and limitations of existing web agents to guide future research directions.
web-agents ✓
evaluation ✓
+ benchmarks
artificial-intelligence ✓
automation ✓