Liquid is an innovative auto-regressive model that integrates visual comprehension and generation by tokenizing images into discrete codes and learning them alongside text tokens. This multimodal large language model operates within a shared feature space, allowing for seamless understanding and generation without relying on external visual embeddings. Liquid is available in multiple sizes and explores the scaling laws of multimodal models, revealing mutual benefits between understanding and generation tasks.
FlexTok is a method for resampling images into 1D token sequences of flexible length, with official implementations and pre-trained models available on GitHub. The repository includes instructions for installation, usage examples, and model checkpoints, emphasizing the importance of using trusted sources for loading checkpoints due to potential security vulnerabilities. Users can easily integrate the FlexTok tokenizer and VAE inference into their projects using provided code snippets and Jupyter notebooks.