ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.17539
59
0

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

21 March 2025
Bhishma Dedhia
David Bourgin
Krishna Kumar Singh
Yuheng Li
Yan Kang
Zhan Xu
N. Jha
Y. Liu
    DiffM
    VGen
ArXivPDFHTML
Abstract

Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

View on arXiv
@article{dedhia2025_2503.17539,
  title={ Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks },
  author={ Bhishma Dedhia and David Bourgin and Krishna Kumar Singh and Yuheng Li and Yan Kang and Zhan Xu and Niraj K. Jha and Yuchen Liu },
  journal={arXiv preprint arXiv:2503.17539},
  year={ 2025 }
}
Comments on this paper