Quit Emailing Yourself

Can LLMs give us AGI if they are bad at arithmetic? – Wes McKinney

7 min read | Saved February 14, 2026 | Copied!

llms 🤖 arithmetic 🤖 coding 🤖 productivity 🤖 models 🤖

Do you care about this?

Wes McKinney explores the arithmetic shortcomings of large language models (LLMs) like Anthropic's Claude Code. He shares his experiences using these coding agents, highlighting how they can improve productivity but often struggle with basic calculations and reliability. Testing various models, he finds that local models perform better than many API options in handling arithmetic tasks.

If you do, here's more

Wes McKinney reflects on his experience with large language models (LLMs) in software development, particularly focusing on their limitations in arithmetic. Initially skeptical about AI tools, he found value in using Anthropic’s Claude Code for automating routine tasks. Over eight months, he noted significant productivity gains, especially when creating low-complexity projects that he previously hesitated to pursue. However, he also faced frustrations with the model's cognitive deficits, including inconsistencies in following instructions and fabricating data.

McKinney highlights a specific instance where he tested LLMs’ arithmetic skills using a CSV dataset to compute sums grouped by IDs. His results showed that LLMs struggled with basic arithmetic, even with small datasets. He conducted experiments with various models, including OpenAI's and Anthropic's, and found that accuracy varied significantly depending on the complexity of the data. For instance, the models performed better on smaller datasets with fewer groups, while their accuracy dropped with larger datasets. This raises questions about the reliability of LLMs for tasks that require precise calculations, contrasting the hype surrounding their potential for achieving artificial general intelligence (AGI).

Questions about this article

No questions yet.