Click any tag below to further narrow down your results
Links
This article outlines Distribution-Aligned Sequence Distillation, a new pipeline for improving reasoning tasks like math and code generation using minimal training data. It introduces models such as DASD-4B-Thinking and DASD-30B-A3B-Thinking-Preview, which outperform larger models in various benchmarks. The methodology includes temperature-scheduled learning and mixed-policy distillation for better performance.
The article discusses the importance of the "harness" in AI coding tools, arguing that it influences performance more than the underlying models themselves. It highlights issues with existing patching methods and proposes a new approach using content hashes to improve edit accuracy. The author emphasizes that innovation in harness design is crucial for advancing AI coding capabilities.
Sakana AI's Sudoku-Bench tests AI reasoning with handcrafted sudoku puzzles. GPT-5 has achieved a 33% solve rate, outperforming previous models but still struggling with complex puzzles. The article explores the limitations of current AI reasoning methods and emphasizes the need for further research.
Poetiq announced it has set new performance standards on the ARC-AGI benchmarks by integrating the latest AI models, Gemini 3 and GPT-5.1. Their systems improve accuracy while reducing costs, demonstrating significant advancements in AI reasoning capabilities.
The article explores the limitations of current evaluation methods for AI models, particularly in assessing design capabilities and reducing the need for constant oversight. It highlights the advancements of Gemini 3 and Opus 4.5 in design and coding tasks, suggesting that existing benchmarks fail to capture these qualities. The author argues for a shift toward more qualitative assessments to better reflect the capabilities of LLMs.
The ARC Prize Foundation evaluates OpenAI's latest models, o3 and o4-mini, using their ARC-AGI benchmarks, revealing varying performance levels in reasoning tasks. While o3 shows significant improvements in accuracy on ARC-AGI-1, both models struggle with the more challenging ARC-AGI-2, indicating ongoing challenges in AI reasoning capabilities. The article emphasizes the importance of model efficiency and the role of public benchmarks in understanding AI advancements.