Quit Emailing Yourself

Without Benchmarking LLMs, You're Likely Overpaying 5-10x | Karl Lorey

6 min read | Saved February 14, 2026 | Copied!

llm 🤖 benchmarking 🤖 cost-saving 🤖 api 🤖 customer-support 🤖

Do you care about this?

The article explains how benchmarking different language models (LLMs) can significantly reduce costs for businesses using API services. By testing specific prompts against various models, users can find cheaper options with comparable performance, potentially saving thousands of dollars.

If you do, here's more

Many businesses are overspending on LLM APIs because they default to popular models like GPT-5 without assessing their actual needs. A founder faced a $1,500 monthly bill for API calls and found that benchmarking against over 100 models could save him significant costs. The benchmarks commonly available don't accurately predict performance for specific tasks, so the author helped the founder build custom benchmarks tailored to his use case, particularly for customer support.

The process involved collecting real customer support chat data, defining expected outputs, and running tests across various LLMs. They used OpenRouter to easily switch between models and gathered data on each model's performance. To evaluate the responses, they employed another LLM to score how well the answers matched the expected outputs. This approach enabled them to analyze quality, cost, and latency effectively.

Ultimately, they discovered that some models delivered comparable quality at significantly lower costs—up to 10 times less. The founder opted for a model that cut costs by 5 times, saving over $1,000 monthly. The complexity of this benchmarking process led the author to develop a tool, evalry, which automates the benchmarking of over 300 LLMs, making it easier for others to find optimal models for their specific requirements.

Questions about this article

No questions yet.