Click any tag below to further narrow down your results
Links
Youtu-VL is a 4B-parameter Vision-Language Model that excels in both vision-centric and general multimodal tasks without needing task-specific modules. It uses a unique autoregressive supervision method to enhance visual understanding and preserve detailed information. The model supports various applications, from image classification to visual question answering.
The olmOCR-2-7B-1025 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, designed to enhance optical character recognition (OCR) capabilities, especially for complex cases like math equations and tables. It is recommended to use the FP8 version for practical applications and can handle large-scale document processing through the olmOCR toolkit. The model demonstrates high performance on various OCR benchmarks.