The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference speed of transformer-based large language models (LLMs) while maintaining performance. It introduces two novel components, QK Adapter and Chunk Adapter, which effectively manage feature compression and chunk attention acquisition, achieving significant speedups during inference, especially with long texts. Experimental results demonstrate that ChunkLLM retains a high level of performance while accelerating processing speeds by up to 4.48 times compared to standard transformer models.
+ llm
inference ✓
transformer ✓