Quit Emailing Yourself

New data on code quality: GPT-5.2 high, Opus 4.5, Gemini 3, and more

5 min read | Saved February 14, 2026 | Copied!

ai 🤖 code-quality 🤖 security 🤖 maintainability 🤖 analysis 🤖

Do you care about this?

This article analyzes the quality, security, and maintainability of code generated by leading AI models like GPT-5.2 High and Gemini 3 Pro using SonarQube. It presents findings on functional performance, complexity, concurrency issues, and security vulnerabilities across various models.

If you do, here's more

The SonarSource article analyzes the code quality generated by various AI models, specifically focusing on GPT-5.2 High, GPT-5.1 High, Gemini 3 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5. Using over 4,000 Java programming assignments, the evaluation measures functional correctness alongside structural quality, security, and maintainability. The findings are presented on the Sonar LLM Leaderboard, which plots models according to pass rate, cognitive complexity, and verbosity. Notably, Opus 4.5 leads with an 83.62% pass rate but generates over 639,000 lines of code, highlighting a trade-off between performance and code complexity.

A closer look reveals that while models like GPT-5.2 High achieve an 80.66% pass rate, they struggle with maintainability and bug density. For instance, GPT-5.2 has the highest code volume and concurrency issues—470 errors per million lines of code—almost double that of the next closest model. In terms of security, GPT-5.2 shows the best results with only 16 blocker vulnerabilities per million lines, whereas Claude Sonnet 4.5 has 198. Maintainability issues are prevalent across models, with code smells comprising 92% to 96% of detected problems. GPT-5.1 generates the most generic smells, at 4,426 per million lines, indicating a significant challenge for developers using AI-generated code. 

Models also exhibit varying reliability in handling software engineering fundamentals. For example, Gemini 3 Pro has the highest control flow mistakes at 200 per million lines, while GPT-5.2 maintains the lowest error rate at just 22. The analysis emphasizes that functional correctness is only part of the picture; understanding the structural integrity and maintainability of generated code is essential for deployment in production environments.

Questions about this article

No questions yet.