1 link tagged with all of: ai + evaluation + benchmark + machine-learning + agents
Links
AIRS-Bench evaluates the research capabilities of large language model agents across 20 tasks in machine learning. Each task includes a problem, dataset, metric, and state-of-the-art value, allowing for performance comparison among various agent configurations. The framework supports contributions from the AI research community for further development.
ai ✓
benchmark ✓
machine-learning ✓
evaluation ✓
agents ✓