6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Saber is a zero-shot framework for reference-to-video generation that relies solely on video-text pairs instead of costly reference image-video-text triplets. It uses masked training with dynamic substitutes to enhance subject integration and generalization across diverse scenarios. The model shows improved performance in generating videos that maintain subject identity while following text prompts.
If you do, here's more
Reference-to-video (R2V) generation faces significant challenges due to its dependency on extensive datasets that include triplets of reference images, videos, and text prompts. These datasets are expensive and difficult to scale, limiting their application to new subjects. Saber is a new framework that addresses these issues by training solely on video-text pairs. It uses a masked training strategy, where randomly selected and masked video frames act as substitutes for reference images. This approach allows the model to learn how to maintain subject identity without needing explicit R2V data, enhancing flexibility and scalability.
The architecture of Saber employs a tailored attention mechanism that directs the model to focus on relevant features while minimizing background distractions. It incorporates random affine transformations to augment mask data, reducing common artifacts in video generation. Additionally, Saber accommodates various reference images and multiple views of the same subject, which allows for richer customization in video outputs. In benchmarks like OpenS2V-Eval, Saber outperforms models that rely on traditional R2V datasets, demonstrating its zero-shot generalization capabilities.
During inference, the model extracts foreground masks from reference images, enabling it to handle both foreground subjects and background scenes effectively. The input format requires resizing and padding reference images to match the target video size while preserving aspect ratios. Qualitative comparisons show that Saber consistently delivers superior subject preservation and video quality across various scenarios when compared to other methods like Kling1.6, Phantom, and VACE. This performance highlights Saberβs potential for applications in video generation where maintaining subject integrity is critical.
Questions about this article
No questions yet.