Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2212.09748
Cited By
v1
v2 (latest)
Scalable Diffusion Models with Transformers
IEEE International Conference on Computer Vision (ICCV), 2022
19 December 2022
William S. Peebles
Saining Xie
GNN
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (18 upvotes)
Papers citing
"Scalable Diffusion Models with Transformers"
50 / 2,688 papers shown
Title
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Yuzhuo Chen
Zehua Ma
Jianhua Wang
Kai Kang
Shunyu Yao
Weiming Zhang
VLM
105
2
0
24 Dec 2025
Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
Yueru Jia
Jiaming Liu
Shengbang Liu
Rui Zhou
W. Yu
Yuyang Yan
Xiaowei Chi
Yandong Guo
Boxin Shi
Shanghang Zhang
VGen
192
1
0
02 Dec 2025
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
Youxin Pang
Jiajun Liu
L. Tan
Yong Zhang
Feng Gao
Xiang Deng
Zhuoliang Kang
Xiaoming Wei
Y. Liu
VGen
60
0
0
02 Dec 2025
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Qinghe Wang
Xiaoyu Shi
Baolu Li
Weikang Bian
Quande Liu
Huchuan Lu
Xintao Wang
Pengfei Wan
Kun Gai
Xu Jia
VGen
142
1
0
02 Dec 2025
YingVideo-MV: Music-Driven Multi-Stage Video Generation
Jiahui Chen
Weida Wang
Runhua Shi
Huan Yang
Chaofan Ding
Zihao Chen
DiffM
VGen
85
0
0
02 Dec 2025
Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
Zhaoqing Wang
Xiaobo Xia
Zhuolin Bie
Jinlin Liu
Dongdong Yu
Jia-Wang Bian
Changhu Wang
EGVM
VGen
117
0
0
02 Dec 2025
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Seungho Choi
Jeahun Sung
Jihyong Oh
DiffM
78
0
0
01 Dec 2025
Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval
Xin Wang
H. Zhang
Mang Li
Zhaohui Xia
Y. Chen
Yu Zhang
Chunyu Wei
DiffM
73
0
0
01 Dec 2025
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Patrick Kwon
Chen Chen
DiffM
AI4TS
VGen
84
0
0
01 Dec 2025
TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance
Pei Yang
Y. Liu
Kelly Peng
Yuan Gao
Yiren Song
WIGM
129
0
0
01 Dec 2025
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
Haodong Yan
Hang Yu
Zhide Zhong
Weilin Yuan
Xin Gong
...
Chengxi Heyu
Junfeng Li
Wenxuan Song
Shunbo Zhou
Haoang Li
8
0
0
01 Dec 2025
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
Zisu Li
Hengye Lyu
Jiaxin Shi
Yufeng Zeng
Mingming Fan
Hanwang Zhang
Chen Liang
VGen
96
0
0
01 Dec 2025
Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1
Junsung Park
Hogun Kee
Songhwai Oh
32
0
0
01 Dec 2025
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Yiyang Ma
Feng Zhou
Xuedan Yin
Pu Cao
Yonghao Dang
Jianqin Yin
12
0
0
01 Dec 2025
Improved Mean Flows: On the Challenges of Fastforward Generative Models
Zhengyang Geng
Yiyang Lu
Zongze Wu
Eli Shechtman
J. Zico Kolter
Kaiming He
AI4CE
28
1
0
01 Dec 2025
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
Y. Liu
Yang Yue
Jingyuan Zhang
Chenxi Sun
Yang Zhou
Wencong Zeng
Ruiming Tang
Guorui Zhou
DiffM
MoE
68
0
0
01 Dec 2025
Reversible Inversion for Training-Free Exemplar-guided Image Editing
Yuke Li
Lianli Gao
Ji Zhang
Pengpeng Zeng
Lichuan Xiang
Hongkai Wen
Heng Tao Shen
Jingkuan Song
DiffM
64
0
0
01 Dec 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Zhiheng Liu
Weiming Ren
Haozhe Liu
Zijian Zhou
S. Chen
...
Ping Luo
Wei Liu
Tao Xiang
Jonas Schult
Yuren Cong
80
0
0
01 Dec 2025
ViT
3
^3
3
: Unlocking Test-Time Training in Vision
Dongchen Han
Y. Li
Tianyu Li
Z. Cao
Ziming Wang
Jun Song
Yu Cheng
Bo Zheng
Gao Huang
ViT
20
0
0
01 Dec 2025
TrajDiff: End-to-end Autonomous Driving without Perception Annotation
Xingtai Gui
Jianbo Zhao
Wencheng Han
Jikai Wang
Jiahao Gong
Feiyang Tan
Cheng-Zhong Xu
Jianbing Shen
8
1
0
30 Nov 2025
CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding
Yi-Lin Wei
Haoran Liao
Yuhao Lin
Pengyue Wang
Zhizhao Liang
Guiliang Liu
Wei-Shi Zheng
16
0
0
30 Nov 2025
Silhouette-based Gait Foundation Model
Dingqiang Ye
Chao Fan
Kartik Narayan
Bingzhe Wu
Chengwen Luo
Jianqiang Li
Vishal M. Patel
16
0
0
30 Nov 2025
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Dong In Lee
Hyungjun Doh
Seunggeun Chi
Runlin Duan
Sangpil Kim
K. Ramani
DiffM
3DGS
VGen
100
0
0
30 Nov 2025
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
Jiahua Wang
Shannan Yan
Leqi Zheng
Jialong Wu
Yaoxin Mao
VGen
32
0
0
30 Nov 2025
Image Generation as a Visual Planner for Robotic Manipulation
Ye Pang
VGen
34
0
0
29 Nov 2025
LAP: Fast LAtent Diffusion Planner with Fine-Grained Feature Distillation for Autonomous Driving
Jinhao Zhang
Wenlong Xia
Zhexuan Zhou
Youmin Gong
Jie Mei
28
0
0
29 Nov 2025
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
Yingxuan You
Chen Zhao
Hantao Zhang
Mingda Xu
Pascal Fua
AI4CE
17
0
0
29 Nov 2025
CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration
Boshi Tang
Henry Zheng
Rui Huang
Gao Huang
VGen
112
0
0
29 Nov 2025
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
Minh-Quan Le
Yuanzhi Zhu
Vicky Kalogeiton
Dimitris Samaras
EGVM
VGen
71
0
0
29 Nov 2025
UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations
Yuzhen Hu
Saurabh Prasad
24
0
0
29 Nov 2025
Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation
Xiao Cui
Yulei Qin
Wengang Zhou
Hongsheng Li
Houqiang Li
DD
OT
140
1
0
29 Nov 2025
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Sinan Du
Jiahao Guo
Bo Li
Shuhao Cui
Zhengzhuo Xu
...
Yongxian Wei
Kun Gai
X. Wang
Kai Wu
C. Yuan
114
0
0
28 Nov 2025
DisMo: Disentangled Motion Representations for Open-World Motion Transfer
Thomas Ressler-Antal
Frank Fundel
Malek Ben Alaya
S. A. Baumann
Felix Krause
Ming Gui
Bjorn Ommer
DiffM
VGen
33
0
0
28 Nov 2025
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Zhizhou Zhong
Yicheng Ji
Zhe Kong
Y. Liu
Jiarui Wang
...
Ying Qin
Huan Li
Shuiyang Mao
W. Liu
Wenhan Luo
DiffM
VGen
60
0
0
28 Nov 2025
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning
Q. Yang
Yingjie Chen
Yuan Yao
Yifang Men
Huaizhuo Liu
Miaomiao Cui
EGVM
VGen
190
0
0
28 Nov 2025
GOATex: Geometry & Occlusion-Aware Texturing
Hyunjin Kim
Kunho Kim
Adam Lee
Wonkwang Lee
DiffM
32
0
0
28 Nov 2025
Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories
Xinxi Zhang
Shiwei Tan
Quang Nguyen
Quan Dao
Ligong Han
Xiaoxiao He
Tunyu Zhang
Alen Mrdovic
Dimitris N. Metaxas
188
0
0
28 Nov 2025
Vision Bridge Transformer at Scale
Zhenxiong Tan
Zeqing Wang
Xingyi Yang
Songhua Liu
Xinchao Wang
DiffM
36
0
0
28 Nov 2025
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
S. Shi
Jing Xu
Zhihang Li
Chunli Peng
Xiaoda Yang
Lijing Lu
Kai Hu
Jiangning Zhang
DiffM
32
0
0
28 Nov 2025
Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis
Jungwoo Seo
David K. Park
Shinjae Yoo
Jiook Cha
MedIm
205
0
0
28 Nov 2025
Guiding Visual Autoregressive Models through Spectrum Weakening
Chaoyang Wang
Tianmeng Yang
Jingdong Wang
Yunhai Tong
DiffM
76
0
0
28 Nov 2025
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
Hongfei Zhang
Kanghao Chen
Zixin Zhang
Harold Haodong Chen
Yuanhuiyi Lyu
Yuqi Zhang
Shuai Yang
Kun Zhou
Yingcong Chen
DiffM
VGen
84
1
0
28 Nov 2025
InstanceV: Instance-Level Video Generation
Yuheng Chen
Teng Hu
Jiangning Zhang
Zhucun Xue
Ran Yi
Lizhuang Ma
DiffM
VGen
64
0
0
28 Nov 2025
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
Zeyu Zhang
Shuning Chang
Yuanyu He
Yizeng Han
Jiasheng Tang
Fan Wang
Bohan Zhuang
DiffM
VGen
56
2
0
28 Nov 2025
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Siqi Chen
Ke Hong
Tianchen Zhao
Ruiqi Xie
Zhenhua Zhu
X. Zhang
Yu Wang
MoE
48
0
0
28 Nov 2025
IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer
Bo Chen
Tao Liu
Qi Chen
Xie Chen
Zilong Zheng
VGen
40
0
0
27 Nov 2025
TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning
Qingtao Yu
Changlin Song
Minghao Sun
Zhengyang Yu
Vinay Kumar Verma
Soumya Roy
Sumit Negi
Hongdong Li
Dylan Campbell
28
0
0
27 Nov 2025
ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models
Zhenglin Zhou
Fan Ma
Xiaobo Xia
Hehe Fan
Yi Yang
Tat-Seng Chua
DiffM
3DGS
69
0
0
27 Nov 2025
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
Bolin Lai
Xudong Wang
Saketh Rambhatla
James M. Rehg
Zsolt Kira
Rohit Girdhar
Ishan Misra
DiffM
68
0
0
27 Nov 2025
StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation
Sen Fang
Hongbin Zhong
Yalin Feng
Dimitris N. Metaxas
48
1
0
27 Nov 2025
1
2
3
4
...
52
53
54
Next