Quit Emailing Yourself

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

2 min read | Saved February 14, 2026 | Copied!

language-models 🤖 recursion 🤖 efficiency 🤖 computation 🤖 transformers 🤖

Do you care about this?

This article introduces Mixture-of-Recursions (MoR), a framework that enhances the efficiency of language models by combining parameter sharing and adaptive computation. MoR dynamically adjusts recursion depths for individual tokens, improving memory access and reducing computational costs while maintaining model performance. It shows significant improvements in validation perplexity and few-shot accuracy across various model sizes.

If you do, here's more

The paper introduces Mixture-of-Recursions (MoR), a framework designed to enhance the efficiency of language models by combining parameter sharing and adaptive computation. As language models grow in size and complexity, their computational and memory requirements become a significant barrier to training and deployment. MoR addresses this challenge by reusing a shared stack of layers across multiple recursion steps, which reduces the number of parameters needed. Meanwhile, it employs lightweight routers that assign varying recursion depths to different tokens, allowing the model to concentrate computational resources on the most relevant tokens at any given time.

MoR optimizes memory access by caching key-value pairs selectively, which improves efficiency further. The framework also includes a variant that shares key-value pairs from the first recursion, streamlining memory usage even more. The experiments conducted on models ranging from 135 million to 1.7 billion parameters show that MoR achieves lower validation perplexity and better few-shot accuracy compared to traditional models, while requiring the same training FLOPs and operating with smaller model sizes. These results indicate that MoR effectively balances the quality of large models with reduced resource demands, offering a promising avenue for future developments in language model efficiency.

Questions about this article

No questions yet.