2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article outlines a method for training judges for Vision-Language Models (VLMs) without human annotations. The approach uses self-synthesized data in an iterative process to improve judgment accuracy, resulting in notable performance gains on various evaluation benchmarks.
If you do, here's more
Vision-Language Models (VLMs) need effective judges to assess their performance, but training these judges usually relies on expensive human annotations. This study introduces a new framework that allows for the self-training of a VLM judge without requiring any human input. The process consists of three main stages: first, it generates a variety of multimodal instruction-response pairs with differing quality levels; next, it produces reasoning traces and judgments for these pairs while filtering out those that donβt meet expected quality standards; finally, it trains the judge using the correct answers and their corresponding reasoning.
The researchers tested their self-trained judge on the Multimodal RewardBench and VL-RewardBench, focusing on several metrics including correctness, preference, reasoning, safety, and visual question-answering. They achieved a notable improvement in accuracy, raising it from 0.38 to 0.51 on VL-RewardBench using a Llama-3.2-11B model. This performance even surpassed that of larger models like Llama-3.2-90B and GPT-4o, particularly excelling in areas like reasoning and hallucination detection.
The results indicate a promising direction for developing autonomous judges that keep pace with the rapid evolution of VLMs. This approach could streamline the evaluation process, making it less dependent on human resources while still maintaining or improving effectiveness in assessing model performance.
Questions about this article
No questions yet.