Quit Emailing Yourself

ChatGPT 5.1 Codex Max

7 min read | Saved February 14, 2026 | Copied!

gpt-5.1 🤖 codex-max 🤖 software-engineering 🤖 cybersecurity 🤖 model-evaluation 🤖

Do you care about this?

OpenAI has launched GPT-5.1-Codex-Max, an upgraded coding model with improved performance metrics over its predecessor. It excels in various software engineering tasks but still faces challenges in cybersecurity capabilities. The article critiques the model's evaluations and compares it to previous versions, raising questions about its real-world usefulness.

If you do, here's more

OpenAI has released GPT-5.1-Codex-Max, their most advanced coding model to date. It boasts improved speed, capability, and efficiency over its predecessor, GPT-5.1-Codex, achieving notable scores: 77.9% on SWE-bench-verified, 79.9% on SWE-Lancer-IC SWE, and 58.1% on Terminal-Bench 2.0. A new 27-page system card details the model's features, including its ability to operate effectively across multiple context windows, allowing it to handle complex tasks more coherently.

In cybersecurity, GPT-5.1-Codex-Max is still developing high-level capabilities but shows promise, particularly in Capture the Flag tasks, where its success rate jumped from 50% to 76%. However, the model's overall performance in cyberoffensive capabilities appears modest compared to GPT-5, with success rates of 37% in Network Attack Simulations and 41% in Vulnerability Discovery challenges. The model has not yet passed tougher tests, raising questions about its readiness for high-stakes situations.

The system card highlights efforts to mitigate harmful tasks, claiming a 100% refusal rate for malware requests, which seems overly optimistic without more rigorous benchmarks. While the model is sandboxed by default on various operating systems, users can adjust settings, introducing potential risks. OpenAI's Safety Advisory Committee has recommended harder evaluations, as current performance metrics do not convincingly demonstrate high capability. Overall, while there are improvements, the model's readiness and effectiveness in cybersecurity remain uncertain.

Questions about this article

No questions yet.