7 links
tagged with all of: language-models + benchmarks
Click any tag below to further narrow down your results
Links
Frontier language models demonstrate the ability to recognize when they are being evaluated, with a significant but not superhuman level of evaluation awareness. This capability raises concerns about the reliability of assessments and benchmarks, as models may behave differently during evaluations. The study includes a benchmark of 1,000 prompts from various datasets and finds that while models outperform random chance in identifying evaluations, they still lag behind human performance.
HELMET (How to Evaluate Long-Context Models Effectively and Thoroughly) is introduced as a comprehensive benchmark for evaluating long-context language models (LCLMs), addressing limitations in existing evaluation methods. The blog outlines HELMET's design, key findings from evaluations of 59 recent LCLMs, and offers a quickstart guide for practitioners to utilize HELMET in their research and applications.
Recursive Language Models (RLMs) are introduced as a novel inference strategy allowing language models to recursively interact with unbounded input context through REPL environments. This approach aims to mitigate the context rot phenomenon and improve performance on long-context benchmarks, showing promising early results that suggest RLMs may enhance general-purpose inference capabilities.
ThinkMesh is a Python library designed for executing various reasoning strategies in parallel using language models, particularly leveraging the Qwen2.5-7B-Instruct model. It supports multiple reasoning approaches such as DeepConf, Self-Consistency, and Debate, catering to a range of problem types from mathematical proofs to planning tasks. The library also includes performance monitoring and benchmarking features to ensure effective usage and integration with different backends.
OLMo 2 is a family of fully-open language models designed for accessibility and reproducibility in AI research. The largest model, OLMo 2 32B, surpasses GPT-3.5-Turbo and GPT-4o mini on various academic benchmarks, while the smaller models (7B, 13B, and 1B) are competitive with other open-weight models. Ai2 emphasizes the importance of open training data and code to advance collective scientific research.
AI Diplomacy reimagines the classic game Diplomacy by having a dozen large language models compete for dominance in a simulated 1901 Europe. The experiment aims to evaluate the negotiation strategies and behaviors of these AIs, revealing insights into their trustworthiness and capabilities. Viewers can watch the AIs interact in real-time through a live Twitch stream.
Researchers at Ai2 propose a method for evaluating language models by measuring the signal-to-noise ratio (SNR) of benchmarks. They demonstrate that higher SNR in benchmarks leads to more reliable model evaluations and suggest interventions to enhance benchmark quality, ultimately improving decision-making in language model training and scaling predictions. A dataset of 900K evaluation results on 465 models is also released to support further research in evaluation methodologies.