Click any tag below to further narrow down your results
Links
This article explains how Large Language Models (LLMs) process prompts from tokenization to response generation. It covers the transformer architecture, including self-attention and feed-forward networks, and details the importance of the KV cache in optimizing performance.
The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference efficiency of large transformer models. It introduces two key components, QK Adapter and Chunk Adapter, which improve feature compression and chunk attention acquisition while maintaining high performance on both long and short text benchmarks. Experimental results indicate significant speedup in processing long texts compared to traditional transformer models.