Quit Emailing Yourself

# multimodal → deep-learning → vision-tokens → efficiency

1 link tagged with all of: multimodal + deep-learning + vision-tokens + efficiency

GitHub - visresearch/LLaVA-STF: The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"

The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

multimodal ✓ vision-tokens ✓ + inference efficiency ✓ deep-learning ✓

Links

GitHub - visresearch/LLaVA-STF: The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"