Quit Emailing Yourself

Small models, big results: Achieving superior intent extraction through decomposition

4 min read | Saved February 14, 2026 | Copied!

intent-extraction 🤖 user-interaction 🤖 multimodal 🤖 on-device 🤖 machine-learning 🤖

Do you care about this?

This article discusses a new method for understanding user intent by breaking down interactions on mobile devices into two stages. By summarizing individual screens and then extracting intent from those summaries, small models can achieve results similar to larger models without needing server processing. The approach improves efficiency and maintains user privacy.

If you do, here's more

Google's recent research introduces a method for understanding user intents through a two-stage process using small multimodal language models (MLLMs). This approach aims to improve how devices interpret user actions during interactions. Instead of relying on large models that require server communication, which can be slow and raise privacy concerns, the new method processes data directly on devices. By breaking down the task into summarizing individual screen actions and then extracting intent from those summaries, the researchers demonstrate that small models can achieve results similar to larger ones.

In the first stage, the model analyzes each screen interaction, asking key questions about the context, user actions, and potential goals. The second stage involves using a fine-tuned model to condense these summaries into a clear intent statement. Techniques like careful label preparation and dropping speculative information from the second stage enhance performance. The researchers evaluated their model using the Bi-Fact approach, which assesses the accuracy of predicted intents against reference intents by breaking them down into "atomic facts" for precise comparison.

The results are promising. The decomposed method outperformed traditional approaches like chain of thought prompting and end-to-end fine-tuning across various tests on mobile and web trajectories. Notably, the Gemini 1.5 Flash 8B model using this approach produced results comparable to the more powerful Gemini 1.5 Pro model but at a significantly lower cost and faster speed. The research indicates that as mobile devices become more powerful, this method could lay the groundwork for enhancing assistive features that rely on understanding user intent.

Questions about this article

No questions yet.