ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.08748
33
90

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

14 May 2024
Zhimin Li
Jianwei Zhang
Qin Lin
Jiangfeng Xiong
Yanxin Long
Xinchi Deng
Yingfang Zhang
Xingchao Liu
Minbin Huang
Zedong Xiao
Dayou Chen
Jiajun He
Jiahao Li
Wenyue Li
Chen Zhang
Rongwei Quan
Jianxiang Lu
Jiabin Huang
Xiaoyan Yuan
Xiao-Ting Zheng
Yixuan Li
Jihong Zhang
Chao Zhang
Mengxi Chen
Jie Liu
Zheng Fang
Weiyan Wang
J. Xue
Yang-Dan Tao
Jianchen Zhu
Kai Liu
Si-Da Lin
Yifu Sun
Yun Li
Dongdong Wang
Ming-Dao Chen
Zhichao Hu
Xiao Xiao
Yan Chen
Yuhong Liu
Wei Liu
Dingyong Wang
Yong Yang
Jie Jiang
Qinglin Lu
    ViT
ArXivPDFHTML
Abstract

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

View on arXiv
Comments on this paper