480
v1v2v3v4v5 (latest)

PiT: Progressive Diffusion Transformer

Main:14 Pages
9 Figures
Bibliography:2 Pages
9 Tables
Abstract

Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo Progressive Diffusion Transformer (PiT). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54% FID improvement over DiT-XL/2 while using less computation.

View on arXiv
Comments on this paper