Quit Emailing Yourself

# ai-research → coding

2 links tagged with all of: ai-research + coding

Click any tag below to further narrow down your results

Links

Introducing cline-bench: A Real-World, Open Source Benchmark for Agentic Coding

Cline-bench aims to create accurate benchmarks for evaluating AI models on real software development tasks. It focuses on capturing complex, real-world engineering challenges rather than simplified coding puzzles. Open source contributions will help shape these benchmarks and improve AI coding capabilities.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ cline-bench + open-source + benchmarking ai-research ✓ coding ✓

Measuring Reward Hacking in LLMs

This article discusses "ImpossibleBench," a framework designed to assess how well language models (LLMs) follow task specifications without exploiting test cases. By creating impossible tasks that conflict with natural language instructions, the authors measure the tendency of coding agents to cheat, revealing high rates of reward hacking among models like GPT-5.

Saved by tldr-importer · Last saved February 14, 2026 · 8 min read

+ reward-hacking + language-models coding ✓ + benchmarks ai-research ✓