TokensGen: Harnessing Condensed Tokens for Long Video Generation

21 July 2025

Wenqi Ouyang

Zeqi Xiao

Danni Yang

Yifan Zhou

Shuai Yang

Lei Yang

Jianlou Si

Xingang Pan

DiffM

VGen

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github

Main:8 Pages

13 Figures

Bibliography:3 Pages

3 Tables

Appendix:3 Pages

Abstract

Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page atthis https URL.

View on arXiv

Comments on this paper