UniVLA presents a novel approach to generalist policy planning using an embodiment-agnostic action space, achieving state-of-the-art results across various benchmarks with efficient training. It includes a comprehensive methodology for extracting latent actions from cross-embodiment videos and guidance on pre-training and fine-tuning models for real-world robot tasks.
GeometryCrafter is a novel framework that estimates high-fidelity and temporally coherent point maps from open-world videos, enhancing 3D/4D reconstruction and depth-based applications. It utilizes a point map Variational Autoencoder (VAE) to effectively encode and decode point maps, achieving state-of-the-art accuracy and temporal consistency across diverse environments. The approach addresses limitations in traditional video depth estimation methods, providing improved geometric fidelity for various tasks.