ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.11708
  4. Cited By
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and
  Generating with Multimodal LLMs
v1v2v3 (latest)

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

International Conference on Machine Learning (ICML), 2024
22 January 2024
Ling Yang
Zhaochen Yu
Chenlin Meng
Minkai Xu
Stefano Ermon
Tengjiao Wang
    CoGeDiffM
ArXiv (abs)PDFHTMLHuggingFace (31 upvotes)Github (1802★)

Papers citing "Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs"

50 / 139 papers shown
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Yuzhuo Chen
Zehua Ma
Jianhua Wang
Kai Kang
Shunyu Yao
Weiming Zhang
VLM
206
3
0
24 Dec 2025
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
Hongyu Li
Manyuan Zhang
Dian Zheng
Ziyu Guo
Yimeng Jia
...
Peng Pei
Xunliang Cai
Linjiang Huang
Hongsheng Li
Si Liu
DiffMLRM
248
3
0
05 Dec 2025
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Shijian Wang
Runhao Fu
Siyi Zhao
Qingqin Zhan
Xingjian Wang
Jiarui Jin
Yuan Lu
Hanqian Wu
Cunjian Chen
EGVM
242
0
0
23 Nov 2025
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Yidong Huang
Zun Wang
Han Lin
Dong-Ki Kim
Shayegan Omidshafiei
Jaehong Yoon
Yue Zhang
Mohit Bansal
VGen
267
1
0
21 Nov 2025
Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Sida Huang
Siqi Huang
Ping Luo
Hongyuan Zhang
DiffM
313
4
0
11 Nov 2025
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li
Zheyuan Liu
Qihui Zhang
Bin Lin
Feize Wu
...
Wangbo Yu
Yuwei Niu
Shaodong Wang
Xinhua Cheng
Li Yuan
409
21
0
19 Oct 2025
Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Zheng Huang
Enpei Zhang
Yinghao Cai
Weikang Qiu
Carl Yang
Elynn Chen
Xiang Zhang
Rex Ying
Dawei Zhou
Yujun Yan
DiffM
131
0
0
17 Oct 2025
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
Mor Ventura
Michael Toker
Or Patashnik
Yonatan Belinkov
Roi Reichart
216
0
0
16 Oct 2025
A Black-Box Debiasing Framework for Conditional Sampling
A Black-Box Debiasing Framework for Conditional Sampling
Han Cui
Jingbo Liu
95
0
0
13 Oct 2025
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu
Ziyang Wang
Na Zheng
Wenjie Wang
Liqiang Nie
Tat-Seng Chua
181
2
0
09 Oct 2025
Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Jessica Bader
Mateusz Pach
Maria A. Bravo
Serge Belongie
Zeynep Akata
162
1
0
30 Sep 2025
Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Xinyu Pu
Hongsong Wang
Jie Gui
Pan Zhou
DiffM
166
2
0
30 Sep 2025
CO3: Contrasting Concepts Compose Better
CO3: Contrasting Concepts Compose Better
Debottam Dutta
Jianchong Chen
Rajalaxmi Rajagopalan
Yu-Lin Wei
Romit Roy Choudhury
DiffM
136
0
0
30 Sep 2025
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo
Chuanhao Yan
Xingqian Xu
Yulin Wang
Kai Wang
Gao Huang
Humphrey Shi
149
1
0
30 Sep 2025
Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Shilin Lu
Zhuming Lian
Zihan Zhou
Shaocong Zhang
Chen Zhao
A. Kong
318
13
0
25 Sep 2025
Embodied AI: From LLMs to World Models
Embodied AI: From LLMs to World Models
Tongtong Feng
Xin Wang
Yu Jiang
Wenwu Zhu
LM&Ro
364
15
0
24 Sep 2025
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps
Bingnan Li
Chen Wang
Haiyang Xu
Xiang Zhang
Ethan Armand
Divyansh Srivastava
Xiaojun Shan
Zeyuan Chen
Jianwen Xie
Zhuowen Tu
VLM
169
1
0
23 Sep 2025
Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis
Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis
Aleksa Jelaca
Ying Jiao
Chang Tian
Marie-Francine Moens
DiffM
89
0
0
23 Sep 2025
Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
Zirui Zheng
Takashi Isobe
Tong Shen
Xu Jia
Jianbin Zhao
...
Dong Li
Dong Zhou
Yunzhi Zhuge
Huchuan Lu
E. Barsoum
176
1
0
15 Sep 2025
PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting
PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting
Linqing Wang
Ximing Xing
Yiji Cheng
Zhiyuan Zhao
Donghao Li
...
Chunyu Wang
Xinchi Deng
S. Gu
C. Wang
Qinglin Lu
408
16
0
04 Sep 2025
MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation
MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation
Yuan Zhao
Lin Liu
DiffMMoE
217
0
0
04 Sep 2025
AniME: Adaptive Multi-Agent Planning for Long Animation Generation
AniME: Adaptive Multi-Agent Planning for Long Animation Generation
Lisai Zhang
Baohan Xu
Siqian Yang
Mingyu Yin
Jing Liu
...
Yidi Wu
Y. Hong
Zihao Zhang
Yanzhang Liang
Yudong Jiang
AI4CE
95
2
0
26 Aug 2025
Instant Preference Alignment for Text-to-Image Diffusion Models
Instant Preference Alignment for Text-to-Image Diffusion Models
Yan Zhao
Songlin Yang
Xiaoxuan Han
Wei Wang
Jing Dong
Yueming Lyu
Ziyu Xue
131
1
0
25 Aug 2025
Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent
Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent
Yixin Gao
Xin Li
Xiaohan Pan
Runsen Feng
Bingchen Li
Y. Qi
Y. Lu
Zhengxue Cheng
Zhibo Chen
Jörn Ostermann
151
0
0
21 Aug 2025
DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
Yisu Liu
Chenxing Li
Wanqian Zhang
Wenfu Wang
Meng Yu
Ruibo Fu
Zheng Lin
Weiping Wang
Dong Yu
DiffM
160
0
0
19 Aug 2025
Preacher: Paper-to-Video Agentic System
Preacher: Paper-to-Video Agentic System
Jingwei Liu
Ling Yang
Hao Luo
Fan Wang
Hongyan Li
M. Y. Wang
DiffMVGen
550
2
0
13 Aug 2025
Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
L. Chen
Jiner Wang
Zihao Pan
B. Zhu
Xiaofeng Yang
Chi Zhang
DiffM
192
2
0
23 Jul 2025
Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
Zejian Li
Yize Li
Chenye Meng
Zhongni Liu
Yang Ling
Shengyuan Zhang
Guang Yang
Changyuan Yang
Zhiyuan Yang
Lingyun Sun
396
6
0
14 Jul 2025
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
Yuxuan Jiang
Zehua Chen
Zeqian Ju
Chang Li
Weibei Dou
Jun Zhu
184
9
0
11 Jul 2025
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
Sixiang Chen
Jianyu Lai
Jialin Gao
Tian-Chun Ye
Haoyu Chen
...
Zhaohu Xing
Yeying Jin
Junfeng Luo
Xiaoming Wei
Lei Zhu
DiffM
287
11
0
12 Jun 2025
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Zhengyao Lv
Tianlin Pan
Chenyang Si
Zhaoxi Chen
W. Zuo
Yu Qiao
Kwan-Yee K. Wong
325
7
0
09 Jun 2025
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
Peng Wang
Yichun Shi
Xiaochen Lian
Zhonghua Zhai
Xin Xia
Xuefeng Xiao
Weilin Huang
Jianchao Yang
429
30
0
05 Jun 2025
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
Di Chang
Mingdeng Cao
Yichun Shi
Bo Liu
Shengqu Cai
Shijie Zhou
Weilin Huang
Gordon Wetzstein
M. Soleymani
Peng Wang
DiffMVGen
369
7
0
03 Jun 2025
Image Generation from Contextually-Contradictory Prompts
Image Generation from Contextually-Contradictory Prompts
Saar Huberman
Or Patashnik
Omer Dahary
Ron Mokady
Daniel Cohen-Or
DiffM
239
3
0
02 Jun 2025
ComposeAnything: Composite Object Priors for Text-to-Image Generation
ComposeAnything: Composite Object Priors for Text-to-Image Generation
Zeeshan Khan
Shizhe Chen
Cordelia Schmid
DiffMCoGe
293
1
0
30 May 2025
A Survey of Generative Categories and Techniques in Multimodal Generative Models
A Survey of Generative Categories and Techniques in Multimodal Generative Models
Longzhen Han
Awes Mubarak
Almas Baimagambetov
Nikolaos Polatidis
Thar Baker
LRM
417
0
0
29 May 2025
Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
Yuxi Zhang
Yueting Li
Xinyu Du
Sibo Wang
DiffMEGVM
265
0
0
28 May 2025
Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Omer Dahary
Yehonathan Cohen
Or Patashnik
Kfir Aberman
Daniel Cohen-Or
DiffM
341
6
0
27 May 2025
ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
Sanghyun Jo
Wooyeol Lee
Ziseok Lee
Kyungsu Kim
1.1K
0
0
27 May 2025
Agentic 3D Scene Generation with Spatially Contextualized VLMs
Agentic 3D Scene Generation with Spatially Contextualized VLMs
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
VGen
367
5
0
26 May 2025
Affective Image Editing: Shaping Emotional Factors via Text Descriptions
Affective Image Editing: Shaping Emotional Factors via Text Descriptions
Peixuan Zhang
Shuchen Weng
Chengxuan Zhu
Binghao Tang
Zijian Jia
Si Li
Boxin Shi
DiffM
193
4
0
24 May 2025
Creatively Upscaling Images with Global-Regional Priors
Creatively Upscaling Images with Global-Regional PriorsInternational Journal of Computer Vision (IJCV), 2025
Yurui Qian
Qi Cai
Yingwei Pan
Ting Yao
Tao Mei
DiffM
399
1
0
22 May 2025
MMaDA: Multimodal Large Diffusion Language Models
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang
Ye Tian
Bowen Li
Xinchen Zhang
Ke Shen
Yunhai Tong
Mengdi Wang
VLMLRM
515
137
0
21 May 2025
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Yicheng Xiao
Lin Song
Yukang Chen
Yingmin Luo
Yuxin Chen
Yukang Gan
Wei Huang
Xiu Li
Xiaojuan Qi
Mingyu Ding
LRM
325
20
0
19 May 2025
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image SynthesisComputer Vision and Pattern Recognition (CVPR), 2025
Bingda Tang
Boyang Zheng
Xichen Pan
Sayak Paul
Saining Xie
285
11
0
15 May 2025
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image GenerationComputer Vision and Pattern Recognition (CVPR), 2025
Mingcheng Li
Xiaolu Hou
Ziyang Liu
Jinjie Wei
Ziyun Qian
Jiawei Chen
Jinjie Wei
Yiheng Jiang
Qingyao Xu
Li Zhang
DiffM
1.2K
12
0
05 May 2025
Step1X-Edit: A Practical Framework for General Image Editing
Step1X-Edit: A Practical Framework for General Image Editing
Shixuan Liu
Yucheng Han
Peng Xing
Fukun Yin
Rui Wang
...
Yibo Zhu
Binxing Jiao
Wei Wei
Gang Yu
Daxin Jiang
DiffM
776
203
0
24 Apr 2025
DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation
DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation
Weijie He
Mushui Liu
YunLong Yu
Zhao Wang
Chao Wu
DiffMVGen
387
1
0
21 Apr 2025
Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers
Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers
Chunyang Zhang
Zhenhong Sun
Zhicheng Zhang
Junyan Wang
Yu Zhang
Dong Gong
H. Mo
Daoyi Dong
423
1
0
14 Apr 2025
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Jialu Li
Shoubin Yu
Han Lin
Jaemin Cho
Jaehong Yoon
Joey Tianyi Zhou
DiffMVGen
386
7
0
11 Apr 2025
123
Next
Page 1 of 3