The article presents the Decoupled Diffusion Transformer (DDT) architecture, demonstrating improved performance with a larger encoder in a diffusion model framework. It achieves state-of-the-art FID scores on ImageNet benchmarks and allows for accelerated inference by reusing encoders across steps. The implementation provides detailed configurations for training and inference, along with online demos.
diffusion ✓
+ transformer
image-generation ✓
deep-learning ✓
benchmarks ✓