Quit Emailing Yourself

# evaluation → web-agents

2 links tagged with all of: evaluation + web-agents

Click any tag below to further narrow down your results

Links

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

This article introduces WebGym, an extensive open-source environment for training visual web agents using nearly 300,000 tasks from real websites. It details a reinforcement learning approach that improves agent performance, achieving a notable increase in success rates on unseen tasks compared to other models.

Saved by tldr-importer · Last saved February 14, 2026 · 2 min read

web-agents ✓ + reinforcement-learning + machine-learning + tasks evaluation ✓

An Illusion of Progress? Assessing the Current State of Web Agents

The study evaluates the capabilities of autonomous web agents based on large language models, revealing a disparity between perceived and actual competencies due to flaws in current benchmarks. It introduces Online-Mind2Web, a new evaluation benchmark comprising 300 tasks across 136 websites, and presents a novel LLM-as-a-Judge method that aligns closely with human assessment. The findings highlight the strengths and limitations of existing web agents to guide future research directions.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

web-agents ✓ evaluation ✓ + benchmarks + artificial-intelligence + automation