U-REPA: Aligning Diffusion U-Nets to ViTs

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach in 200 epochs or 1M iterations on ImageNet 256 256, and needs only half the total epochs to perform better than REPA. Codes are available atthis https URL.
View on arXiv@article{tian2025_2503.18414, title={ U-REPA: Aligning Diffusion U-Nets to ViTs }, author={ Yuchuan Tian and Hanting Chen and Mengyu Zheng and Yuchen Liang and Chao Xu and Yunhe Wang }, journal={arXiv preprint arXiv:2503.18414}, year={ 2025 } }