A new benchmark for generative world models (WMs) is introduced, focusing on their effectiveness in closed-loop environments that reflect real agent-environment interactions. This research emphasizes task success over visual quality and reveals that controllability and effective post-training data scaling are crucial for improving embodied agents' performance. The study establishes a systematic evaluation framework for future research in generative world models.