The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference efficiency of large transformer models. It introduces two key components, QK Adapter and Chunk Adapter, which improve feature compression and chunk attention acquisition while maintaining high performance on both long and short text benchmarks. Experimental results indicate significant speedup in processing long texts compared to traditional transformer models.
+ llm
efficiency ✓
transformer ✓