The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.
multimodal ✓
vision-tokens ✓
+ inference
efficiency ✓
deep-learning ✓