v1v2 (latest)

Scalable Diffusion Models with Transformers

IEEE International Conference on Computer Vision (ICCV), 2022

19 December 2022

William S. Peebles

Saining Xie

GNN

ArXiv (abs)PDF HTML HuggingFace (18 upvotes)

Papers citing "Scalable Diffusion Models with Transformers"

50 / 2,712 papers shown

FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

146

03 Oct 2025

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

122

03 Oct 2025

Paris: A Decentralized Trained Open-Weight Diffusion Model

03 Oct 2025

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

267

03 Oct 2025

Best-of-Majority: Minimax-Optimal Strategy for Pass@

k

113

03 Oct 2025

When and Where do Events Switch in Multi-Event Video Generation?

213

03 Oct 2025

Growing Visual Generative Capacity for Pre-Trained MLLMs

204

02 Oct 2025

Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

Zeqi Ye

Minshuo Chen

152

02 Oct 2025

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

Runqian Wang

Yilun Du

DiffM

896

02 Oct 2025

Contrastive Representation Regularization for Vision-Language-Action Models

232

02 Oct 2025

Fine-Grained GRPO for Precise Preference Alignment in Flow Models

234

02 Oct 2025

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

353

02 Oct 2025

Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

327

02 Oct 2025

Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor

Romit Roy Choudhury

DiffM

150

02 Oct 2025

FreeViS: Training-free Video Stylization with Inconsistent References

208

02 Oct 2025

Learning to Generate Rigid Body Interactions with Video Diffusion Models

458

02 Oct 2025

Pack and Force Your Memory: Long-form and Consistent Video Generation

244

02 Oct 2025

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

313

02 Oct 2025

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

231

02 Oct 2025

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

155

02 Oct 2025

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

165

01 Oct 2025

Image Generation Based on Image Style Extraction

Shuochen Chang

128

01 Oct 2025

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

189

01 Oct 2025

Fine-Tuning Masked Diffusion for Provable Self-Correction

304

01 Oct 2025

Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation

149

01 Oct 2025

Learn to Guide Your Diffusion Model

448

01 Oct 2025

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan

Jan-Willem van de Meent

Mohammad Mahdi Derakhshani

Floor Eijkelboom

147

01 Oct 2025

IMAGEdit: Let Any Subject Transform

122

01 Oct 2025

Arbitrary Generative Video Interpolation

157

01 Oct 2025

Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling

280

01 Oct 2025

Syntax-Guided Diffusion Language Models with User-Integrated Personalization

130

01 Oct 2025

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

136

01 Oct 2025

Selective Underfitting in Diffusion Models

144

01 Oct 2025

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

...

149

01 Oct 2025

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-ResolutionComputer Vision and Pattern Recognition (CVPR), 2025

299

30 Sep 2025

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing

...

153

30 Sep 2025

LieHMR: Autoregressive Human Mesh Recovery with

SO(3)

212

30 Sep 2025

Refine Drugs, Don't Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery

30 Sep 2025

Post-Training Quantization for Audio Diffusion Transformers

Tanmay Khandelwal

Magdalena Fuentes

117

30 Sep 2025

OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation

298

30 Sep 2025

AReUReDi: Annealed Rectified Updates for Refining Discrete Flows with Multi-Objective Guidance

Tong Chen

Yinuo Zhang

Pranam Chatterjee

170

30 Sep 2025

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

104

30 Sep 2025

$DA$^{2}$: Depth Anything in Any Direction$

^{2}

: Depth Anything in Any Direction

479

30 Sep 2025

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

123

30 Sep 2025

Flow Autoencoders are Effective Protein Tokenizers

125

30 Sep 2025

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

125

30 Sep 2025

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

155

30 Sep 2025

Video Object Segmentation-Aware Audio Generation

184

30 Sep 2025

Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

152

30 Sep 2025

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

156

30 Sep 2025