3 links
tagged with all of: artificial-intelligence + evaluation
Click any tag below to further narrow down your results
Links
SpatialScore introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) in spatial understanding, consisting of the VGBench dataset and an extensive collection of 28K samples. It features the SpatialAgent, a multi-agent system designed for enhanced spatial reasoning, and reveals persistent challenges and improvements in spatial tasks through quantitative and qualitative evaluations.
The content of the article appears to be corrupted or unreadable, making it impossible to extract meaningful information or insights about the evaluation of GPT-5. Therefore, no summary can be provided based on the current text.
The study evaluates the capabilities of autonomous web agents based on large language models, revealing a disparity between perceived and actual competencies due to flaws in current benchmarks. It introduces Online-Mind2Web, a new evaluation benchmark comprising 300 tasks across 136 websites, and presents a novel LLM-as-a-Judge method that aligns closely with human assessment. The findings highlight the strengths and limitations of existing web agents to guide future research directions.