REPA-E introduces a family of end-to-end tuned Variational Autoencoders (VAEs) that significantly improve text-to-image (T2I) generation quality and training efficiency. The method enables effective joint training of VAEs and diffusion models, achieving state-of-the-art performance on ImageNet and enhancing latent space structure across various VAE architectures. Results show accelerated generation performance and better image quality, making E2E-VAEs superior replacements for traditional VAEs.