Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2505.12605
Cited By
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
19 May 2025
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding"
44 / 44 papers shown
Title
Qwen2.5-VL Technical Report
S. Bai
Keqin Chen
Xuejing Liu
Jialin Wang
Wenbin Ge
...
Zesen Cheng
Hang Zhang
Zhibo Yang
Haiyang Xu
Junyang Lin
VLM
503
1,345
0
20 Feb 2025
HourVideo: 1-Hour Video-Language Understanding
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Pengfei Yu
Jiajun Wu
L. Fei-Fei
VLM
153
62
0
07 Nov 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Yufan Zhou
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
131
40
0
04 Oct 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
229
8
0
04 Jul 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
252
19
1
09 Jun 2024
Encoding and Controlling Global Semantics for Long-form Video Question Answering
Thong Nguyen
Zhiyuan Hu
Xiaobao Wu
Cong-Duy Nguyen
See-Kiong Ng
Anh Tuan Luu
120
5
0
30 May 2024
Streaming Long Video Understanding with Large Language Models
Rui Qian
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Shuangrui Ding
Dahua Lin
Yuan Liu
VLM
163
69
0
25 May 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLM
VLM
145
225
0
25 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
180
131
0
08 Apr 2024
ST-LLM: Large Language Models Are Effective Temporal Learners
Ruyang Liu
Chen Li
Haoran Tang
Yixiao Ge
Ying Shan
Ge Li
125
101
0
30 Mar 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
236
67
0
20 Feb 2024
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
Thong Nguyen
Xiaobao Wu
Xinshuai Dong
Cong-Duy Nguyen
See-Kiong Ng
Anh Tuan Luu
128
9
0
05 Dec 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
167
374
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
307
638
0
28 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
515
896
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
240
295
0
14 Nov 2023
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
Iqra Qasim
Alexander Horsch
Dilip K. Prasad
129
11
0
05 Nov 2023
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
...
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLM
MLLM
331
270
0
03 Oct 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
188
319
0
13 Jul 2023
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Ming-Hui Qiu
MLLM
217
224
0
12 Jun 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
233
778
0
08 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang
Xin Li
Lidong Bing
MLLM
335
1,238
0
05 Jun 2023
VideoLLM: Modeling Video Sequence with Large Language Models
Guo Chen
Yin-Dong Zheng
Jiahao Wang
Jilan Xu
Yifei Huang
...
Yi Wang
Yali Wang
Yu Qiao
Tong Lu
Limin Wang
MLLM
201
97
0
22 May 2023
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
244
663
0
10 May 2023
A Review of Deep Learning for Video Captioning
Moloud Abdar
Meenakshi Kollati
Swaraja Kuraparthi
Farhad Pourpanah
Daniel J. McDuff
...
Shuicheng Yan
Abduallah A. Mohamed
Abbas Khosravi
Min Zhang
Fatih Porikli
3DV
145
26
0
22 Apr 2023
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
632
5,875
0
17 Apr 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
223
270
0
27 Feb 2023
Grounding Language Models to Images for Multimodal Inputs and Outputs
Jing Yu Koh
Ruslan Salakhutdinov
Daniel Fried
MLLM
214
137
0
31 Jan 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
797
5,472
0
30 Jan 2023
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
154
87
0
09 Dec 2022
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
356
816
0
14 Nov 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
246
74
0
04 Sep 2022
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
116
127
0
07 Jun 2022
Revisiting the "Video" in Video-Language Understanding
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
125
179
0
03 Jun 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
725
4,920
0
28 Jan 2022
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
169
397
0
03 Nov 2021
Towards Long-Form Video Understanding
Chaoxia Wu
Philipp Krahenbuhl
VLM
ViT
149
183
0
21 Jun 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
262
595
0
18 May 2021
Video Summarization Using Deep Neural Networks: A Survey
Evlampios Apostolidis
E. Adamantidou
Alexandros I. Metsai
Vasileios Mezaris
Ioannis Patras
AI4TS
214
230
0
15 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.0K
46,199
0
22 Oct 2020
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang
Dajun Ding
Yongming Rao
Yu Zheng
Danyang Zhang
Lili Zhao
Jiwen Lu
Jie Zhou
190
338
0
07 Mar 2019
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
320
1,326
0
02 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
229
889
0
28 Mar 2017
Deep Learning for Video Classification and Captioning
Zuxuan Wu
Ting Yao
Yanwei Fu
Yu-Gang Jiang
3DV
VLM
93
130
0
22 Sep 2016
1