Quit Emailing Yourself

From GRPO to GPT-5: Sudoku Variants

6 min read | Saved February 14, 2026 | Copied!

sudoku 🤖 ai 🤖 reasoning 🤖 benchmarks 🤖 models 🤖

Do you care about this?

Sakana AI's Sudoku-Bench tests AI reasoning with handcrafted sudoku puzzles. GPT-5 has achieved a 33% solve rate, outperforming previous models but still struggling with complex puzzles. The article explores the limitations of current AI reasoning methods and emphasizes the need for further research.

If you do, here's more

In May 2025, Sakana AI introduced the Sudoku-Bench, a set of handcrafted sudoku puzzles designed to test the reasoning capabilities of large language models (LLMs). At the time of the release, models like ChatGPT-o3 struggled to solve any classic 9x9 sudoku puzzles. Fast forward to today, GPT-5 has emerged as a leader on the Sudoku-Bench, achieving a 33% solve rate on challenge_100. This marks a significant improvement, as it’s the first model capable of solving a 9x9 modern sudoku problem. Despite this progress, the Sudoku-Bench remains a formidable challenge, with only a third of the puzzles solved.

The article highlights that while GPT-5 shows strong performance, the task of solving these puzzles exposes limitations in AI reasoning. Current models often manage local consistency but falter on global reasoning chains, especially when faced with creative problem-solving techniques used by human experts. The benchmark includes both classic and modern sudoku variants, which require models to adapt to new rules and constraints. The modern puzzles are particularly tough, as they demand sophisticated meta-reasoning beyond traditional problem-solving strategies.

Sakana AI also explored how open-source models like GRPO, known for their efficiency in math tasks, performed on the Sudoku-Bench. However, the results were disappointing, indicating that reasoning skills effective in math don’t necessarily translate to the spatial and logical complexities of sudoku. The article emphasizes a promising approach where LLMs are trained directly on human reasoning processes, using video transcripts from expert solvers. Yet, the challenge lies in managing the lengthy nature of these transcripts. The team summarized this data into key insights to guide models' reasoning, aiming to enhance their problem-solving abilities.

Ultimately, while GPT-5 has made strides, the gap between human and AI reasoning remains significant. The article underscores that even advanced training methods face limitations in applying human-like reasoning to sudoku puzzles. GPT-5 excels in algebraic reasoning but struggles with spatial challenges, demonstrating that more work is needed to bridge the divide between human thought processes and AI capabilities.

Questions about this article

No questions yet.