A new study reveals that state-of-the-art Vision Language Models (VLMs) exhibit severe confirmation bias, achieving 100% accuracy on familiar images but dropping to approximately 17% accuracy on counterfactual images. The models rely on memorized knowledge rather than actual visual analysis, resulting in a significant gap between their performance on unmodified and modified images. The research highlights that 75.70% of errors are bias-aligned, indicating a fundamental flaw in how VLMs process multimodal information.
Researchers from Meta and The Hebrew University found that shorter reasoning processes in large language models significantly enhance accuracy, achieving up to 34.5% higher correctness compared to longer chains. This study challenges the conventional belief that extensive reasoning leads to better performance, suggesting that efficiency can lead to both cost savings and improved results.