Quit Emailing Yourself

On-Device LLMs: State of the Union, 2026

6 min read | Saved February 14, 2026 | Copied!

on-device 🤖 language-models 🤖 mobile 🤖 efficiency 🤖 privacy 🤖

Do you care about this?

This article discusses the advancements in on-device language models, highlighting their advantages in latency, privacy, cost, and availability. It examines the constraints of mobile devices and explores effective strategies for building smaller, efficient models that can still perform complex tasks.

If you do, here's more

On-device language models (LLMs) have evolved significantly in recent years. In 2023, billion-parameter models can run in real-time on high-end smartphones, a leap from the earlier toy demos. The advantages of on-device LLMs include lower latency (under 20ms per token), enhanced privacy due to data remaining on the device, reduced costs compared to cloud solutions, and consistent availability regardless of internet connectivity. However, the main challenge remains the limitations of edge devices, which struggle with memory and bandwidth constraints.

Memory is a key bottleneck, as mobile devices typically offer less than 4GB of RAM, limiting the size of models they can effectively run. While mobile NPUs deliver impressive TOPS (e.g., Apple A19 Pro at ~35 TOPS and Qualcomm Snapdragon 8 Elite Gen 5 at ~60 TOPS), their support for various model operations is often inadequate. Memory bandwidth is another critical factor, with mobile devices having only 50-90 GB/s compared to data center GPUs, which reach 2-3 TB/s. This discrepancy impacts LLM inference, particularly during the token generation process. Consequently, model compression techniques and token prediction strategies are vital for improving performance on mobile.

Recent findings indicate that small models, particularly those under 1 billion parameters, can still perform effectively if designed correctly. Research from MobileLLM shows that architecture matters more than sheer parameter count; deep-thin architectures often outperform wider models at smaller scales. For instance, a 125M parameter model can generate 50 tokens per second on an iPhone. Major players like Meta, Google, and Microsoft have adopted this approach, producing models that leverage high-quality training data and specialized methodologies to enhance capability. Distillation methods from larger reasoning models have proven effective in enhancing smaller models' reasoning abilities, but challenges remain, especially for complex tasks that require extensive reasoning or broad knowledge.

Questions about this article

No questions yet.