169
v1v2v3v4v5 (latest)

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

Main:9 Pages
17 Figures
Bibliography:5 Pages
23 Tables
Appendix:10 Pages
Abstract

Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives--aligned predicted noise vs. high-quality images--between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across 88 models, 44 samplers, and resolutions from 256×256256\times256 to 2K2K demonstrate superior performance and speedup of our framework. For instance, it achieves over 40%40\% latency reduction (i.e., 2.07×2.07\times theoretical speedup) and improved performance on PixArt-α\alpha. Remarkably, our image-free approach reduces training time by 25%25\% compared with the previous method. Our code is available atthis https URL.

View on arXiv
Comments on this paper