Quit Emailing Yourself

GitHub - facebookresearch/airs-bench: AIRS-Bench: an AI Research Science benchmark for quantifying the end-to-end AI research abilities of LLM agents

5 min read | Saved February 14, 2026 | Copied!

ai 🤖 benchmark 🤖 machine-learning 🤖 evaluation 🤖 agents 🤖

Do you care about this?

AIRS-Bench evaluates the research capabilities of large language model agents across 20 tasks in machine learning. Each task includes a problem, dataset, metric, and state-of-the-art value, allowing for performance comparison among various agent configurations. The framework supports contributions from the AI research community for further development.

If you do, here's more

The AI Research Science Benchmark (AIRS-Bench) evaluates the autonomous research capabilities of large language model (LLM) agents in machine learning. It consists of 20 tasks drawn from notable machine learning papers, covering areas like natural language processing (NLP), code generation, mathematics, and time series forecasting. Each task is defined by a triplet: a problem (e.g., text similarity), a dataset (e.g., SICK), and a performance metric (e.g., Spearman Correlation), alongside a state-of-the-art (SOTA) value set by human researchers.

The AIRS-Bench framework assesses various agents, which are combinations of LLMs and scaffolds—mechanisms that help the LLM navigate the solution space. The research utilizes both linear and parallel harness frameworks to evaluate agent performance, with scaffolds like ReAct and benchmarks like One-shot and Greedy. Notably, the performance results show that agents can vary widely in effectiveness, with some consistently surpassing human benchmarks while others struggle to deliver viable solutions.

The benchmark results reveal a range of normalized scores for different agents, with the Greedy gpt-oss-120b model achieving an average score of 0.522. The performance varies across tasks, indicating differing levels of difficulty. Each task's specifications are well-documented, including the necessary metadata and scripts to facilitate agent training and evaluation. AIRS-Bench encourages contributions from the AI research community, particularly those that leverage open-source components, fostering an environment for collaborative improvement and exploration of agentic AI capabilities.

Questions about this article

No questions yet.