Quit Emailing Yourself

Can Bölük on X: "I improved 15 LLMs at coding in one afternoon. Only the harness changed." / X

7 min read | Saved February 14, 2026 | Copied!

harness 🤖 ai-coding 🤖 benchmarks 🤖 models 🤖 edits 🤖

Do you care about this?

The article discusses the importance of the "harness" in AI coding tools, arguing that it influences performance more than the underlying models themselves. It highlights issues with existing patching methods and proposes a new approach using content hashes to improve edit accuracy. The author emphasizes that innovation in harness design is crucial for advancing AI coding capabilities.

If you do, here's more

The article critiques the current focus on which AI coding model is superior, such as GPT-5.3 versus Opus, arguing that the real issue lies in the harness used to interact with these models. The author, who has contributed extensively to an open-source coding agent called "oh-my-pi," emphasizes that the harness is crucial for managing user input, output tokens, and error handling. Many models fail in practical applications not due to their coding capabilities but because their harnesses don’t efficiently manage edits and inputs.

The author highlights various editing methods employed by different models. For instance, Codex uses an "apply_patch" method, which can lead to high failure rates when used with models unfamiliar with its structure. In contrast, simpler methods like "str_replace" can encounter issues with exact text reproduction, resulting in common errors. A more advanced approach involves using a separate neural network to merge edits, but even that isn't foolproof. The piece presents a new concept: tagging each line of code with a content hash to improve accuracy and reliability during edits. This would allow models to reference specific lines without needing to reproduce prior content, thus reducing errors.

The benchmarking results are striking. Sixteen models were tested across three editing tools, revealing that the patch format performed poorly overall. Alternatives like hashline or replace methods yielded better results, especially for weaker models, which showed dramatic improvements in performance. For example, Grok Code Fast 1’s success rate jumped from 6.7% to 68.3% after switching formats. The author also mentions their own negative experiences with major players like Anthropic and Google, highlighting a restrictive environment for developing and testing harnesses.

Questions about this article

No questions yet.