Quit Emailing Yourself

5 links tagged with all of: efficiency + language-models

Click any tag below to further narrow down your results

Links

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

This article introduces Mixture-of-Recursions (MoR), a framework that enhances the efficiency of language models by combining parameter sharing and adaptive computation. MoR dynamically adjusts recursion depths for individual tokens, improving memory access and reducing computational costs while maintaining model performance. It shows significant improvements in validation perplexity and few-shot accuracy across various model sizes.

Saved by tldr-importer · Last saved February 14, 2026 · 2 min read

language-models ✓ + recursion efficiency ✓ + computation + transformers

On-Device LLMs: State of the Union, 2026

This article discusses the advancements in on-device language models, highlighting their advantages in latency, privacy, cost, and availability. It examines the constraints of mobile devices and explores effective strategies for building smaller, efficient models that can still perform complex tasks.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ on-device language-models ✓ + mobile efficiency ✓ + privacy

Efficiently Serving Large Language Models: A Gentle Introduction to VLLM Framework

The article serves as an introduction to VLLM, a framework designed for serving large language models efficiently. It discusses the benefits of using VLLM, including reduced latency and improved resource management, making it suitable for production environments. Key features and implementation steps are also highlighted to assist users in adopting this technology.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

+ vllm + machine-learning + serving language-models ✓ efficiency ✓

[no-title]

The article discusses the phenomenon that shorter tokens in language models tend to have a higher likelihood of being selected in various contexts. It explores the implications of this tendency for understanding how language processing works in computational models. Additionally, the author examines how the length of tokens can affect the efficiency and accuracy of these models.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

language-models ✓ + tokenization + computational-linguistics efficiency ✓ + accuracy

Chain of Draft: Thinking Faster by Writing Less

The paper introduces the Chain of Draft (CoD) paradigm, which enables Large Language Models (LLMs) to generate concise intermediate reasoning outputs, mimicking human draft strategies. By focusing on essential information and reducing verbosity, CoD achieves comparable or superior accuracy to Chain-of-Thought prompting while utilizing significantly fewer tokens, thus lowering costs and latency in reasoning tasks.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

+ chain-of-thought language-models ✓ + reasoning efficiency ✓ + minimalism