Quit Emailing Yourself

Are we in a GPT-4-style leap that evals can't see?

6 min read | Saved February 14, 2026 | Copied!

ai 🤖 evaluation 🤖 design 🤖 benchmarks 🤖 models 🤖

Do you care about this?

The article explores the limitations of current evaluation methods for AI models, particularly in assessing design capabilities and reducing the need for constant oversight. It highlights the advancements of Gemini 3 and Opus 4.5 in design and coding tasks, suggesting that existing benchmarks fail to capture these qualities. The author argues for a shift toward more qualitative assessments to better reflect the capabilities of LLMs.

If you do, here's more

The author believes we might be experiencing a significant shift in AI capabilities, similar to the leap from GPT-3.5 to GPT-4, though subtler in nature. They argue that traditional methods of evaluating language models, particularly through chat interactions, have become ineffective. Users have grown accustomed to a certain level of performance, leading to disappointment with new releases. The author emphasizes the importance of response speed over quality in day-to-day interactions with these models, noting a personal preference for quicker responses, which has affected their evaluation criteria.

Gemini 3 Pro Preview has emerged as a standout model, especially for design tasks, where it excels at generating attractive prototypes for websites and landing pages. The author highlights its ability to create visually appealing outputs that align closely with existing branding, which is a significant advantage in product development. In contrast, other models tend to produce generic results filled with emojis. The author provides a practical method for leveraging Gemini 3’s capabilities by utilizing CSS files to generate tailored prototypes, suggesting that this model could enhance the prototyping process significantly.

On the software engineering front, Opus 4.5 is noted for its improved performance compared to previous models like Sonnet 4.5. Users experience fewer errors and less need for constant oversight, making it seem more competent in managing complex tasks. The author points out that current benchmarks for evaluating models often fail to capture these qualitative aspects, such as design taste and iterative interaction. They argue that the industry should adopt new benchmarks that reflect real-world usage, focusing on the nuanced capabilities of LLMs rather than strictly quantitative assessments.

Questions about this article

No questions yet.