Click any tag below to further narrow down your results
Links
Cline-bench aims to create accurate benchmarks for evaluating AI models on real software development tasks. It focuses on capturing complex, real-world engineering challenges rather than simplified coding puzzles. Open source contributions will help shape these benchmarks and improve AI coding capabilities.
This article discusses "ImpossibleBench," a framework designed to assess how well language models (LLMs) follow task specifications without exploiting test cases. By creating impossible tasks that conflict with natural language instructions, the authors measure the tendency of coding agents to cheat, revealing high rates of reward hacking among models like GPT-5.