5 links
tagged with all of: evaluation + benchmarks
Click any tag below to further narrow down your results
Links
HELMET (How to Evaluate Long-Context Models Effectively and Thoroughly) is introduced as a comprehensive benchmark for evaluating long-context language models (LCLMs), addressing limitations in existing evaluation methods. The blog outlines HELMET's design, key findings from evaluations of 59 recent LCLMs, and offers a quickstart guide for practitioners to utilize HELMET in their research and applications.
Humanity's Last Exam (HLE), an AI benchmark for evaluating PhD-level research, has been criticized for having a significant percentage of its biology and chemistry questions (29 ± 3.7%) contradicting peer-reviewed literature. An independent follow-up revealed 18% of a subset of questions were problematic, prompting the HLE team to initiate a rolling revision process to improve the evaluation. The review process's design may have led to confusing and incorrect questions that do not reflect true scientific knowledge.
ScreenSuite is introduced as the most comprehensive evaluation suite for GUI agents, designed to benchmark vision language models (VLMs) across various capabilities such as perception, grounding, and multi-step actions. It provides a modular and vision-only framework for evaluating GUI agents in realistic scenarios, allowing for easier integration and reproducibility in AI research.
The study evaluates the capabilities of autonomous web agents based on large language models, revealing a disparity between perceived and actual competencies due to flaws in current benchmarks. It introduces Online-Mind2Web, a new evaluation benchmark comprising 300 tasks across 136 websites, and presents a novel LLM-as-a-Judge method that aligns closely with human assessment. The findings highlight the strengths and limitations of existing web agents to guide future research directions.
Researchers at Ai2 propose a method for evaluating language models by measuring the signal-to-noise ratio (SNR) of benchmarks. They demonstrate that higher SNR in benchmarks leads to more reliable model evaluations and suggest interventions to enhance benchmark quality, ultimately improving decision-making in language model training and scaling predictions. A dataset of 900K evaluation results on 465 models is also released to support further research in evaluation methodologies.