Quit Emailing Yourself

# multimodal → deep-learning

5 links tagged with all of: multimodal + deep-learning

Click any tag below to further narrow down your results

Links

Junfeng5/Liquid_V1_7B · Hugging Face

Liquid is an innovative auto-regressive model that integrates visual comprehension and generation by tokenizing images into discrete codes and learning them alongside text tokens. This multimodal large language model operates within a shared feature space, allowing for seamless understanding and generation without relying on external visual embeddings. Liquid is available in multiple sizes and explores the scaling laws of multimodal models, revealing mutual benefits between understanding and generation tasks.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

multimodal ✓ + language-model + image-generation + tokenization deep-learning ✓

Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer

MingTok introduces the first continuous unified tokenizer for vision, enabling seamless integration of image understanding and generation within a single framework. This innovation leads to 3.5x faster convergence by aligning semantic understanding and generative dynamics, allowing for efficient multi-turn interactions without the costly detours seen in previous models. Ming-UniVision, built on MingTok, effectively harmonizes these tasks, paving the way for more intuitive multimodal AI systems.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ mingtok + vision multimodal ✓ + autoregressive deep-learning ✓

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

multimodal ✓ + image-generation + computer-vision deep-learning ✓ + open-source

GitHub - Tencent-Hunyuan/HunyuanImage-3.0: HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation

HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ image-generation + open-source multimodal ✓ + artificial-intelligence deep-learning ✓

GitHub - visresearch/LLaVA-STF: The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"

The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

multimodal ✓ + vision-tokens + inference + efficiency deep-learning ✓