Representation Autoencoders (RAEs) enhance diffusion transformers by leveraging pretrained encoders and lightweight decoders to achieve superior image generation results, outperforming traditional methods like SD-VAE. The study reveals that RAE's reconstruction quality is high, and for optimal performance, the model width must match or exceed the encoder's token dimension. Additionally, the proposed DiTDH model demonstrates significant efficiency and effectiveness, setting new state-of-the-art scores in image generation tasks.
representation-autoencoders ✓
diffusion-transformers ✓
image-generation ✓
+ neural-networks
machine-learning ✓