ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.12489
  4. Cited By
Video-GPT via Next Clip Diffusion

Video-GPT via Next Clip Diffusion

18 May 2025
Shaobin Zhuang
Zhipeng Huang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Binxin Yang
Chong Sun
Chen Li
Yali Wang
    DiffM
    VGen
ArXivPDFHTML

Papers citing "Video-GPT via Next Clip Diffusion"

50 / 71 papers shown
Title
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Yuchao Gu
Weijia Mao
Mike Zheng Shou
VGen
104
3
0
25 Mar 2025
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
Zhengrong Yue
Shaobin Zhuang
Kunchang Li
Yanbo Ding
Yansen Wang
VGen
DiffM
43
1
0
15 Mar 2025
Get In Video: Add Anything You Want to the Video
Shaobin Zhuang
Zhipeng Huang
Binxin Yang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Chong Sun
Zheng-Jun Zha
Chen Li
Yijiao Wang
DiffM
VGen
65
2
0
08 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
84
3
0
03 Mar 2025
Taming Teacher Forcing for Masked Autoregressive Video Generation
Taming Teacher Forcing for Masked Autoregressive Video Generation
Deyu Zhou
Quan Sun
Yuang Peng
Kun Yan
Runpei Dong
...
Zheng Ge
Nan Duan
Xiangyu Zhang
L. Ni
H. Shum
VGen
66
7
0
21 Jan 2025
Causal Diffusion Transformers for Generative Modeling
Causal Diffusion Transformers for Generative Modeling
Chaorui Deng
Deyao Zhu
Kunchang Li
Shi Guang
Haoqi Fan
DiffM
78
7
0
16 Dec 2024
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin
Qiang Zhang
Richard Zhang
William T. Freeman
F. Durand
Eli Shechtman
Xun Huang
VGen
DiffM
118
11
0
10 Dec 2024
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin
Yunyang Ge
Xinhua Cheng
Zongjian Li
Bin Zhu
...
Zhang Pan
Xing Zhou
Shaoling Dong
Yonghong Tian
Li-xin Yuan
VLM
VGen
139
68
0
28 Nov 2024
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava
VGen
61
10
0
28 Oct 2024
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Qiuheng Wang
Yukai Shi
Jiarong Ou
Ruoxin Chen
Ke Lin
...
Mingwu Zheng
Xin Tao
Fei Yang
Pengfei Wan
Di Zhang
VGen
111
23
0
10 Oct 2024
Emu3: Next-Token Prediction is All You Need
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
65
183
0
27 Sep 2024
Transfusion: Predict the Next Token and Diffuse Images with One
  Multi-Modal Model
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou
Lili Yu
Arun Babu
Kushal Tirumala
Michihiro Yasunaga
Leonid Shamis
Jacob Kahn
Xuezhe Ma
Luke Zettlemoyer
Omer Levy
DiffM
72
165
0
20 Aug 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang
Jiayan Teng
Wendi Zheng
Ming Ding
Shiyu Huang
...
Weihan Wang
Yean Cheng
Xiaotao Gu
Yuxiao Dong
Jie Tang
DiffM
VGen
117
453
0
12 Aug 2024
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan
Rui Xie
Penghao Zhou
Tiehan Fan
Zhenheng Yang
Zhijie Chen
Xiang Li
Jian Yang
Ying Tai
111
76
0
02 Jul 2024
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Boyuan Chen
Diego Marti Monso
Yilun Du
Max Simchowitz
Russ Tedrake
Vincent Sitzmann
DiffM
56
85
0
01 Jul 2024
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Junke Wang
Yi Jiang
Zehuan Yuan
Binyue Peng
Zuxuan Wu
Yu-Gang Jiang
ViT
VGen
90
39
0
13 Jun 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image
  Generation
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
VLM
91
253
0
10 Jun 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
113
290
0
16 May 2024
FlexiFilm: Long Video Generation with Flexible Conditions
FlexiFilm: Long Video Generation with Flexible Conditions
Yichen Ouyang
Jianhao Yuan
Hao Zhao
Gaoang Wang
Bo Zhao
DiffM
44
7
0
29 Apr 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
  Phone
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin
Sam Ade Jacobs
A. A. Awan
J. Aneja
Ahmed Hassan Awadallah
...
Li Zhang
Yi Zhang
Yue Zhang
Yunan Zhang
Xiren Zhou
LRM
ALM
77
1,136
0
22 Apr 2024
A Survey on Long Video Generation: Challenges, Methods, and Prospects
A Survey on Long Video Generation: Challenges, Methods, and Prospects
Chengxuan Li
Di Huang
Zeyu Lu
Yang Xiao
Qingqi Pei
Lei Bai
EGVM
47
22
0
25 Mar 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser
Sumith Kulal
A. Blattmann
Rahim Entezari
Jonas Muller
...
Zion English
Kyle Lacey
Alex Goodwin
Yannik Marek
Robin Rombach
DiffM
203
1,187
0
05 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
108
188
0
29 Feb 2024
Video as the New Language for Real-World Decision Making
Video as the New Language for Real-World Decision Making
Sherry Yang
Jacob Walker
Jack Parker-Holder
Yilun Du
Jake Bruce
Andre Barreto
Pieter Abbeel
Dale Schuurmans
VGen
62
48
0
27 Feb 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
  Synthesis
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Willi Menapace
Aliaksandr Siarohin
Ivan Skorokhodov
Ekaterina Deyneka
Tsai-Shien Chen
...
Yuwei Fang
A. Stoliar
Elisa Ricci
Jian Ren
Sergey Tulyakov
VGen
79
57
0
22 Feb 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
76
71
0
13 Feb 2024
Lumiere: A Space-Time Diffusion Model for Video Generation
Lumiere: A Space-Time Diffusion Model for Video Generation
Omer Bar-Tal
Hila Chefer
Omer Tov
Charles Herrmann
Roni Paiss
...
T. Michaeli
Oliver Wang
Deqing Sun
Tali Dekel
Inbar Mosseri
VGen
139
233
0
23 Jan 2024
Vlogger: Make Your Dream A Vlog
Vlogger: Make Your Dream A Vlog
Shaobin Zhuang
Kunchang Li
Xinyuan Chen
Yaohui Wang
Ziwei Liu
Yu Qiao
Yali Wang
VGen
DiffM
54
36
0
17 Jan 2024
Latte: Latent Diffusion Transformer for Video Generation
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma
Yaohui Wang
Gengyun Jia
Xinyuan Chen
Ziqiang Liu
Yuan-Fang Li
Cunjian Chen
Yu Qiao
DiffM
VGen
154
252
0
05 Jan 2024
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Lijun Yu
Xiuye Gu
José Lezama
Jonathan Huang
...
Irfan Essa
Huisheng Wang
David A. Ross
Bryan Seybold
Lu Jiang
VGen
68
249
0
21 Dec 2023
Sequential Modeling Enables Scalable Learning for Large Vision Models
Sequential Modeling Enables Scalable Learning for Large Vision Models
Yutong Bai
Xinyang Geng
K. Mangalam
Amir Bar
Alan Yuille
Trevor Darrell
Jitendra Malik
Alexei A. Efros
MLLM
VLM
49
158
0
01 Dec 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
  Datasets
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann
Tim Dockhorn
Sumith Kulal
Daniel Mendelevitch
Maciej Kilian
...
Zion English
Vikram S. Voleti
Adam Letts
Varun Jampani
Robin Rombach
VGen
205
1,082
0
25 Nov 2023
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
  Prediction
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
Xinyuan Chen
Yaohui Wang
Lingjun Zhang
Shaobin Zhuang
Xin Ma
Jiashuo Yu
Yali Wang
Dahua Lin
Yu Qiao
Ziwei Liu
VGen
DiffM
29
137
0
31 Oct 2023
A Picture is Worth a Thousand Words: Principled Recaptioning Improves
  Image Generation
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
Eyal Segalis
Dani Valevski
Danny Lumen
Yossi Matias
Yaniv Leviathan
DiffM
79
24
0
25 Oct 2023
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu
José Lezama
N. B. Gundavarapu
Luca Versari
Kihyuk Sohn
...
Boqing Gong
Ming-Hsuan Yang
Irfan Essa
David A. Ross
Lu Jiang
70
297
0
09 Oct 2023
PixArt-$α$: Fast Training of Diffusion Transformer for
  Photorealistic Text-to-Image Synthesis
PixArt-ααα: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen
Jincheng Yu
Chongjian Ge
Lewei Yao
Enze Xie
...
Zhongdao Wang
James T. Kwok
Ping Luo
Huchuan Lu
Zhenguo Li
DiffM
70
414
0
30 Sep 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by
  Using Diffusion Model with ControlNet
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet
Zhihao Hu
Dong Xu
DiffM
VGen
64
65
0
26 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
71
255
0
13 Jul 2023
SDXL: Improving Latent Diffusion Models for High-Resolution Image
  Synthesis
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell
Zion English
Kyle Lacey
A. Blattmann
Tim Dockhorn
Jonas Muller
Joe Penna
Robin Rombach
162
2,242
0
04 Jul 2023
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang
Bingkun Huang
Zhiyu Zhao
Zhan Tong
Yinan He
Yi Wang
Yali Wang
Yu Qiao
VGen
80
341
0
29 Mar 2023
VideoFusion: Decomposed Diffusion Models for High-Quality Video
  Generation
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Zhengxiong Luo
Dayou Chen
Yingya Zhang
Yan Huang
Liangsheng Wang
Yujun Shen
Deli Zhao
Jinren Zhou
Tien-Ping Tan
DiffM
VGen
157
314
0
15 Mar 2023
Scalable Diffusion Models with Transformers
Scalable Diffusion Models with Transformers
William S. Peebles
Saining Xie
GNN
60
2,182
0
19 Dec 2022
Flow Matching for Generative Modeling
Flow Matching for Generative Modeling
Y. Lipman
Ricky T. Q. Chen
Heli Ben-Hamu
Maximilian Nickel
Matt Le
OOD
103
1,189
0
06 Oct 2022
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer
Adam Polyak
Thomas Hayes
Xiaoyue Yin
Jie An
...
Oron Ashual
Oran Gafni
Devi Parikh
Sonal Gupta
Yaniv Taigman
DiffM
VGen
64
1,373
0
29 Sep 2022
Classifier-Free Diffusion Guidance
Classifier-Free Diffusion Guidance
Jonathan Ho
Tim Salimans
FaML
75
3,786
0
26 Jul 2022
CogVideo: Large-scale Pretraining for Text-to-Video Generation via
  Transformers
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong
Ming Ding
Wendi Zheng
Xinghan Liu
Jie Tang
DiffM
277
585
0
29 May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language
  Understanding
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia
William Chan
Saurabh Saxena
Lala Li
Jay Whang
...
Raphael Gontijo-Lopes
Tim Salimans
Jonathan Ho
David J Fleet
Mohammad Norouzi
VLM
265
5,904
0
23 May 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya A. Ramesh
Prafulla Dhariwal
Alex Nichol
Casey Chu
Mark Chen
VLM
DiffM
276
6,768
0
13 Apr 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
  Transformer
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Songwei Ge
Thomas Hayes
Harry Yang
Xiaoyue Yin
Guan Pang
David Jacobs
Jia-Bin Huang
Devi Parikh
ViT
72
215
0
07 Apr 2022
Video Diffusion Models
Video Diffusion Models
Jonathan Ho
Tim Salimans
Alexey A. Gritsenko
William Chan
Mohammad Norouzi
David J. Fleet
DiffM
VGen
138
1,563
0
07 Apr 2022
12
Next