Quit Emailing Yourself

# ai-research → benchmarks

2 links tagged with all of: ai-research + benchmarks

Click any tag below to further narrow down your results

Links

Measuring Reward Hacking in LLMs

This article discusses "ImpossibleBench," a framework designed to assess how well language models (LLMs) follow task specifications without exploiting test cases. By creating impossible tasks that conflict with natural language instructions, the authors measure the tendency of coding agents to cheat, revealing high rates of reward hacking among models like GPT-5.

Saved by tldr-importer · Last saved February 14, 2026 · 8 min read

+ reward-hacking + language-models + coding benchmarks ✓ ai-research ✓

OLMo from Ai2

OLMo 2 is a family of fully-open language models designed for accessibility and reproducibility in AI research. The largest model, OLMo 2 32B, surpasses GPT-3.5-Turbo and GPT-4o mini on various academic benchmarks, while the smaller models (7B, 13B, and 1B) are competitive with other open-weight models. Ai2 emphasizes the importance of open training data and code to advance collective scientific research.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

+ olmo + language-models + open-source ai-research ✓ benchmarks ✓