v1v2 (latest)

Scalable Diffusion Models with Transformers

IEEE International Conference on Computer Vision (ICCV), 2022

19 December 2022

William S. Peebles

Saining Xie

GNN

ArXiv (abs)PDF HTML HuggingFace (18 upvotes)

Papers citing "Scalable Diffusion Models with Transformers"

50 / 2,712 papers shown

Vision-Language-Action Models for Robotics: A Review Towards Real-World ApplicationsIEEE Access (IEEE Access), 2025

277

08 Oct 2025

Revisiting Mixout: An Overlooked Path to Robust Finetuning

245

08 Oct 2025

DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

200

08 Oct 2025

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

106

08 Oct 2025

scPPDM: A Diffusion Model for Single-Cell Drug-Response Prediction

08 Oct 2025

Heptapod: Language Modeling on Visual Signals

162

08 Oct 2025

DreamOmni2: Multimodal Instruction-based Editing and Generation

...

118

08 Oct 2025

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

143

08 Oct 2025

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

135

08 Oct 2025

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations

Johannes Brandstetter

213

08 Oct 2025

$$\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection$

\bf{D^3}

QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

193

07 Oct 2025

Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

Young D. Kwon

Abhinav Mehrotra

Malcolm Chadwick

Alberto Gil C. P. Ramos

S. Bhattacharya

DiffM

168

07 Oct 2025

Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model

Danush Kumar Venkatesh

Adam Schmidt

Muhammad Abdullah Jamal

Omid Mohareri

VGen MedIm

144

07 Oct 2025

VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation

200

07 Oct 2025

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

229

07 Oct 2025

SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Kevin Blackburn-Matzen

Matheus Gadelha

127

07 Oct 2025

Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

Muhammad Dehan Al Kautsar

Fajri Koto

199

07 Oct 2025

Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models

...

101

07 Oct 2025

Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning

Andrew Ly

Pulin Gong

AI4CE

187

07 Oct 2025

Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer

Maxence Lasbordes

Sinoué Gad

134

07 Oct 2025

LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation

148

06 Oct 2025

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

117

06 Oct 2025

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

232

06 Oct 2025

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

193

06 Oct 2025

TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement

341

06 Oct 2025

SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

236

06 Oct 2025

Pulp Motion: Framing-aware multimodal camera and human motion generation

196

06 Oct 2025

Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion

191

06 Oct 2025

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

...

245

06 Oct 2025

Scaling Sequence-to-Sequence Generative Neural Rendering

...

Juan-Manuel Perez-Rua

VGen

129

05 Oct 2025

MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

Lixuan He

Shikang Zheng

Linfeng Zhang

159

05 Oct 2025

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

153

05 Oct 2025

Principled and Tractable RL for Reasoning with Diffusion Language Models

Anthony Zhan

DiffM AI4CE

114

05 Oct 2025

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

137

05 Oct 2025

FoilDiff: A Hybrid Transformer Backbone for Diffusion-based Modelling of 2D Airfoil Flow Fields

539

05 Oct 2025

Drax: Speech Recognition with Discrete Flow Matching

130

05 Oct 2025

MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

Xuehai He

Shijie Zhou

Thivyanth Venkateswaran

163

05 Oct 2025

Proximal Diffusion Neural Sampler

167

04 Oct 2025

Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs

145

04 Oct 2025

Neon: Negative Extrapolation From Self-Training Improves Image Generation

308

04 Oct 2025

Generating Human Motion Videos using a Cascaded Text-to-Video Framework

127

04 Oct 2025

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

235

03 Oct 2025

What Drives Compositional Generalization in Visual Generative Models?

325

03 Oct 2025

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

267

03 Oct 2025

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

193

03 Oct 2025

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

121

03 Oct 2025

FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

146

03 Oct 2025

When and Where do Events Switch in Multi-Event Video Generation?

213

03 Oct 2025

Best-of-Majority: Minimax-Optimal Strategy for Pass@

k

113

03 Oct 2025

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

169

03 Oct 2025