EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

8 May 2025

Abstract

Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$ -DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

View on arXiv

@article{xie2025_2505.05209,
  title={ EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution },
  author={ Haizhen Xie and Kunpeng Du and Qiangyu Yan and Sen Lu and Jianhong Han and Hanting Chen and Hailin Hu and Jie Hu },
  journal={arXiv preprint arXiv:2505.05209},
  year={ 2025 }
}

Comments on this paper