2 links tagged with all of: llm + transformer + inference
Click any tag below to further narrow down your results
Links
This article explains how Large Language Models (LLMs) process prompts from tokenization to response generation. It covers the transformer architecture, including self-attention and feed-forward networks, and details the importance of the KV cache in optimizing performance.
The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference speed of transformer-based large language models (LLMs) while maintaining performance. It introduces two novel components, QK Adapter and Chunk Adapter, which effectively manage feature compression and chunk attention acquisition, achieving significant speedups during inference, especially with long texts. Experimental results demonstrate that ChunkLLM retains a high level of performance while accelerating processing speeds by up to 4.48 times compared to standard transformer models.
llm ✓
inference ✓
transformer ✓