1 link tagged with all of: multimodal + deep-learning + ai + model
Click any tag below to further narrow down your results
Links
Youtu-VL is a 4B-parameter Vision-Language Model that excels in both vision-centric and general multimodal tasks without needing task-specific modules. It uses a unique autoregressive supervision method to enhance visual understanding and preserve detailed information. The model supports various applications, from image classification to visual question answering.