Click any tag below to further narrow down your results
Links
This article presents a codebase for a study on how unified multimodal models (UMMs) enhance reasoning by integrating visual generation. The research introduces a new evaluation suite, VisWorld-Eval, which assesses multimodal reasoning capabilities across various tasks. Experiments show that interleaved visual-verbal reasoning outperforms purely verbal methods in specific contexts.
A new benchmark for generative world models (WMs) is introduced, focusing on their effectiveness in closed-loop environments that reflect real agent-environment interactions. This research emphasizes task success over visual quality and reveals that controllability and effective post-training data scaling are crucial for improving embodied agents' performance. The study establishes a systematic evaluation framework for future research in generative world models.