Quit Emailing Yourself

# llm → transformer → inference

2 links tagged with all of: llm + transformer + inference

Click any tag below to further narrow down your results

Links

How LLM Inference Works

This article explains how Large Language Models (LLMs) process prompts from tokenization to response generation. It covers the transformer architecture, including self-attention and feed-forward networks, and details the importance of the KV cache in optimizing performance.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

llm ✓ inference ✓ + tokenization transformer ✓ + kv-cache

[2510.02361] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference speed of transformer-based large language models (LLMs) while maintaining performance. It introduces two novel components, QK Adapter and Chunk Adapter, which effectively manage feature compression and chunk attention acquisition, achieving significant speedups during inference, especially with long texts. Experimental results demonstrate that ChunkLLM retains a high level of performance while accelerating processing speeds by up to 4.48 times compared to standard transformer models.

Saved by hn_user_11 · 1 other saved this · Last saved October 28, 2025 · 3 min read

llm ✓ inference ✓ transformer ✓