Quit Emailing Yourself

Evaluating Gemini Robotics Policies in a Veo World Simulator

4 min read | Saved February 14, 2026 | Copied!

robotics 🤖 simulation 🤖 policy-evaluation 🤖 video-models 🤖 safety 🤖

Do you care about this?

This article discusses a new generative evaluation system for assessing robotics policies using the Veo World Simulator. It demonstrates how video models can predict robot performance across various scenarios, including out-of-distribution conditions and safety testing. The system has been validated through extensive real-world evaluations of multiple policy checkpoints.

If you do, here's more

The article presents a generative evaluation system for robotics, built on a foundation model called Veo. This system is designed to assess various robotic policies across a wide range of conditions, from standard scenarios to out-of-distribution (OOD) environments. Generative world models like Veo can simulate realistic interactions by leveraging video models, which typically perform well only in scenarios closely related to their training data. The authors demonstrate that these models can extend their utility to include OOD evaluations, assessing factors such as safety and generalization.

The approach includes fine-tuning the Veo model on a large dataset of robotics tasks, enabling it to predict future scenarios based on current observations and intended actions. With over 1,600 real-world evaluations across various tasks and eight policy checkpoints, the system has shown strong predictive accuracy. For instance, it effectively ranks robot policies in pick-and-place tasks and identifies how different OOD conditions, such as new objects or backgrounds, impact performance. The model also facilitates "red teaming," where potential vulnerabilities in policy behavior can be identified without physical trials.

The evaluation system utilizes generative image editing to create diverse scene variations for testing OOD generalization. Results indicate that the predictions made by Veo correlate well with actual success rates in real-world trials. While the system is promising, the article acknowledges ongoing challenges, such as improving multi-view consistency and enhancing the realism of physical interactions. Overall, the work represents a significant step toward more robust robotic policy evaluation in varied and complex environments.

Questions about this article

No questions yet.