Quit Emailing Yourself

# machine-learning → agents → evaluation

1 link tagged with all of: machine-learning + agents + evaluation

Click any tag below to further narrow down your results

Links

GitHub - facebookresearch/airs-bench: AIRS-Bench: an AI Research Science benchmark for quantifying the end-to-end AI research abilities of LLM agents

AIRS-Bench evaluates the research capabilities of large language model agents across 20 tasks in machine learning. Each task includes a problem, dataset, metric, and state-of-the-art value, allowing for performance comparison among various agent configurations. The framework supports contributions from the AI research community for further development.

Saved by tldr-importer · Last saved February 14, 2026 · 5 min read

+ ai + benchmark machine-learning ✓ evaluation ✓ agents ✓