Quit Emailing Yourself

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

3 min read | Saved February 14, 2026 | Copied!

reasoning 🤖 vla 🤖 planning 🤖 manipulation 🤖 latency 🤖

Do you care about this?

Fast-ThinkAct is a framework designed to enhance reasoning in vision-language-action tasks by compressing lengthy textual reasoning into concise latent representations. It improves inference speed by up to 9.3 times while maintaining strong performance in tasks that require both visual understanding and action execution. The approach includes a teacher-student model where the student learns efficient reasoning from the teacher's guidance.

If you do, here's more

Fast-ThinkAct is a new framework designed for Vision-Language-Action (VLA) tasks. It focuses on improving reasoning efficiency by compressing lengthy textual reasoning into more concise latent chain-of-thoughts (CoTs). This method allows for significant speed improvements, achieving up to a 9.3 times faster inference while still delivering strong reasoning performance for tasks requiring interaction with dynamic visual environments.

The framework operates by training a teacher model that uses detailed textual reasoning combined with action-aligned visual rewards. A student model is then distilled from this teacher, learning to perform compact latent reasoning. This process is guided by a preference-driven objective that helps align the learned manipulation trajectories with both language and visual planning. As a result, Fast-ThinkAct enhances policy learning, connecting concise reasoning directly to action execution.

In extensive experiments, Fast-ThinkAct demonstrated a remarkable reduction in inference latency, decreasing it by up to 89.3% compared to existing state-of-the-art VLA models. It maintained effective long-horizon planning and adaptability even in complex manipulation scenarios, as shown through evaluations on multiple benchmarks like SimplerEnv and RoboTwin2.0. The qualitative results indicate that the verbalized reasoning from the student model is not only more compact but also more accurate than the verbose outputs from the teacher model, highlighting its practical advantages in real-world applications.

Questions about this article

No questions yet.