57
19

Faster Diffusion via Temporal Attention Decomposition

Abstract

We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available atthis https URL.

View on arXiv
@article{liu2025_2404.02747,
  title={ Faster Diffusion via Temporal Attention Decomposition },
  author={ Haozhe Liu and Wentian Zhang and Jinheng Xie and Francesco Faccio and Mengmeng Xu and Tao Xiang and Mike Zheng Shou and Juan-Manuel Perez-Rua and Jürgen Schmidhuber },
  journal={arXiv preprint arXiv:2404.02747},
  year={ 2025 }
}
Comments on this paper