ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2509.20427
70
5
v1v2 (latest)

Seedream 4.0: Toward Next-generation Multimodal Image Generation

24 September 2025
Team Seedream
Yunpeng Chen
Yu Gao
Lixue Gong
Meng Guo
Qiushan Guo
Zhiyao Guo
Xiaoxia Hou
Weilin Huang
Y. Huang
Xiaowen Jian
Huafeng Kuang
Zhichao Lai
Fanshi Li
Liang Li
Xiaochen Lian
Chao Liao
Liyang Liu
Wei Liu
Yanzuo Lu
Zhengxiong Luo
Tongtong Ou
Guang Shi
Yichun Shi
Shiqi Sun
Yu Tian
Zhi Tian
Peng Wang
Rui Wang
Xun Wang
Y. Wang
Guofeng Wu
J. Wu
Wenxu Wu
Yonghui Wu
Xin Xia
Xuefeng Xiao
Shuang Xu
Xin Yan
Ceyuan Yang
Jianchao Yang
Zhonghua Zhai
C. Zhang
H. Zhang
Qi Zhang
Xinyu Zhang
Y. Zhang
Shijia Zhao
Wenliang Zhao
Wenjia Zhu
    MLLMVLM
ArXiv (abs)PDFHTMLHuggingFace (68 upvotes)Github (24387★)
Main:16 Pages
15 Figures
Bibliography:2 Pages
Appendix:1 Pages
Abstract

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on this https URL.

View on arXiv
Comments on this paper