ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.17432
  4. Cited By
Video Understanding with Large Language Models: A Survey
v1v2v3v4v5 (latest)

Video Understanding with Large Language Models: A Survey

29 December 2023
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
Teng Wang
Daoan Zhang
Jie An
Jingyang Lin
Rongyi Zhu
Ali Vosoughi
Chao Huang
Zeliang Zhang
Pinxin Liu
Mingqian Feng
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
    VLM
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (2325★)

Papers citing "Video Understanding with Large Language Models: A Survey"

50 / 105 papers shown
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
Ali Vosoughi
Chong Chen
Chenliang Xu
LRM
210
17
0
14 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
460
2
0
12 Mar 2025
ComicsPAP: understanding comic strips by picking the correct panel
ComicsPAP: understanding comic strips by picking the correct panelIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025
Emanuele Vivoli
Artemis LLabres
Mohamed Ali Soubgui
Marco Bertini
Ernest Valveny Llobet
Dimosthenis Karatzas
457
2
0
11 Mar 2025
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban SpacesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Baining Zhao
Jianjie Fang
Zichao Dai
Liang Luo
Jirong Zha
...
Chen Gao
Yijiao Wang
Jinqiang Cui
Xinlei Chen
Yongqian Li
338
21
0
08 Mar 2025
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
Zeliang Zhang
Yifan Zhu
Susan Liang
Zhiyuan Wang
Jiani Liu
...
Mingjie Zhao
Chenliang Xu
Kun Wan
Wentian Zhao
Wentian Zhao
VLMMQ
372
0
0
15 Feb 2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Yilun Zhao
Lujing Xie
Haowei Zhang
Guo Gan
Yitao Long
...
Xiangru Tang
Zhenwen Liang
Yongxu Liu
Chen Zhao
Arman Cohan
287
67
0
21 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with LanguageComputer Vision and Pattern Recognition (CVPR), 2023
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Yuan Liu
Kaipeng Zhang
Dahua Lin
Yu Qiao
Shiyang Feng
Xiangyu Yue
MLLM
553
194
0
10 Jan 2025
Generative AI for Cel-Animation: A Survey
Generative AI for Cel-Animation: A Survey
Yunlong Tang
Junjia Guo
Pinxin Liu
Zhiyuan Wang
Hang Hua
...
Jing Bi
Mingqian Feng
Xuzhao Li
Zeliang Zhang
Chenliang Xu
VGen
695
17
0
08 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMComputer Vision and Pattern Recognition (CVPR), 2024
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
418
39
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
478
74
0
31 Dec 2024
When SAM2 Meets Video Shadow and Mirror Detection
When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
VLM
216
1
0
26 Dec 2024
Do Language Models Understand Time?
Do Language Models Understand Time?The Web Conference (WWW), 2024
Xi Ding
Lei Wang
919
10
0
18 Dec 2024
VisionZip: Longer is Better but Not Necessary in Vision Language Models
VisionZip: Longer is Better but Not Necessary in Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Senqiao Yang
Yukang Chen
Zhuotao Tian
Chengyao Wang
Jingyao Li
Bei Yu
Jiaya Jia
VLM
281
105
0
05 Dec 2024
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zenghui Ding
Xianjun Yang
Yining Sun
1.1K
9
0
21 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Computer Vision and Pattern Recognition (CVPR), 2024
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
408
16
0
17 Nov 2024
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
Chenglin Li
Qianglong Chen
Zhi Li
Feng Tao
Yin Zhang
422
0
0
14 Nov 2024
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
Hao Liang
Zirong Chen
Feiyu Xiong
Wentao Zhang
309
0
0
11 Nov 2024
Making Every Frame Matter: Continuous Activity Recognition in Streaming Video via Adaptive Video Context Modeling
Making Every Frame Matter: Continuous Activity Recognition in Streaming Video via Adaptive Video Context Modeling
Hao Wu
Donglin Bai
Shiqi Jiang
Qianxi Zhang
Yue Yang
Ting Cao
Fengyuan Xu
Yunxin Liu
Fengyuan Xu
588
0
0
19 Oct 2024
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
  Embodied AI
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Sijie Cheng
Kechen Fang
Yangyang Yu
Sicheng Zhou
Yangqiu Song
Ye Tian
Tingguang Li
Lei Han
Yang Liu
254
17
0
15 Oct 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient
  Training-free Video LLMs
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Kai Han
Jianyuan Guo
Yehui Tang
W. He
Enhua Wu
Yunhe Wang
MLLMVLM
200
16
0
14 Oct 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
  Vision-Language Models
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLMCoGe
232
22
0
13 Oct 2024
G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction
  Following by Combining Large Pre-trained Models
G2^{2}2TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
183
0
0
10 Oct 2024
Temporal Reasoning Transfer from Text to Video
Temporal Reasoning Transfer from Text to VideoInternational Conference on Learning Representations (ICLR), 2024
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Dianbo Sui
Qi Liu
LRM
179
20
0
08 Oct 2024
Enhancing Temporal Modeling of Video LLMs via Time Gating
Enhancing Temporal Modeling of Video LLMs via Time GatingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zi-Yuan Hu
Yiwu Zhong
Shijia Huang
Michael R. Lyu
Liwei Wang
VLM
188
7
0
08 Oct 2024
On Efficient Variants of Segment Anything Model: A Survey
On Efficient Variants of Segment Anything Model: A SurveyInternational Journal of Computer Vision (IJCV), 2024
Xiaorui Sun
Jing Liu
Mengqi Li
Xiaofeng Zhu
Ping Hu
VLM
505
18
0
07 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization
  Benchmark
UAL-Bench: The First Comprehensive Unusual Activity Localization BenchmarkIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
255
5
0
02 Oct 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on
  Comprehensive Long Video Understanding
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
293
19
0
27 Sep 2024
EAGLE: Egocentric AGgregated Language-video Engine
EAGLE: Egocentric AGgregated Language-video EngineACM Multimedia (MM), 2024
Jing Bi
Yunlong Tang
Luchuan Song
Ali Vosoughi
Nguyen Nguyen
Chenliang Xu
222
16
0
26 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
490
9
0
23 Sep 2024
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Ming Li
Keyu Chen
Ziqian Bi
Ming Liu
Xinyuan Song
...
Jinlang Wang
Sen Zhang
Xuanhe Pan
Jiawei Xu
Pohsun Feng
OffRL
275
11
0
17 Sep 2024
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video
  ANnotAtions
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions
Alexandru Bobe
Jan van Gemert
197
0
0
16 Sep 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical StudyInternational Journal of Computer Vision (IJCV), 2024
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
344
24
0
08 Aug 2024
CoMMIT: Coordinated Multimodal Instruction Tuning
CoMMIT: Coordinated Multimodal Instruction Tuning
Junda Wu
Xintong Li
Tong Yu
Yu Wang
Xiang Chen
Jiuxiang Gu
Lina Yao
Julian McAuley
Jingbo Shang
164
4
0
29 Jul 2024
The Synergy between Data and Multi-Modal Large Language Models: A Survey
  from Co-Development Perspective
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin
Daoyuan Chen
Wenhao Zhang
Liuyi Yao
Yilun Huang
Bolin Ding
Yaliang Li
Shuiguang Deng
347
11
0
11 Jul 2024
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding
  with Task Divide-and-Conquer
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Lu Zhang
Tiancheng Zhao
Heting Ying
Yibo Ma
Kyusong Lee
LLMAG
272
25
0
24 Jun 2024
Towards Event-oriented Long Video Understanding
Towards Event-oriented Long Video Understanding
Yifan Du
Kun Zhou
Yuqi Huo
Yifan Li
Wayne Xin Zhao
Haoyu Lu
Zijia Zhao
Bingning Wang
Weipeng Chen
Ji-Rong Wen
VLM
201
19
0
20 Jun 2024
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li
Xinyu Chen
Baotian Hu
Longyue Wang
Haoyuan Shi
Min Zhang
MLLMLRM
388
58
0
17 Jun 2024
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal
  Model
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Lu Xu
Sijie Zhu
Chunyuan Li
Chia-Wen Kuo
Fan Chen
Xinyao Wang
Guang Chen
Dawei Du
Ye Yuan
Longyin Wen
251
12
0
15 Jun 2024
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park
Kanchana Ranasinghe
Kumara Kahatapitiya
Wonjeong Ryoo
Donghyun Kim
Michael S. Ryoo
369
59
0
13 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
570
26
1
09 Jun 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
472
147
0
29 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
373
64
0
26 May 2024
Graphic Design with Large Multimodal Model
Graphic Design with Large Multimodal Model
Yutao Cheng
Zhao Zhang
Maoke Yang
Hui Nie
Chunyuan Li
Xinglong Wu
Jie Shao
327
26
0
22 Apr 2024
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
416
47
0
18 Apr 2024
From Image to Video, what do we need in multimodal LLMs?
From Image to Video, what do we need in multimodal LLMs?
Suyuan Huang
Haoxin Zhang
Yan Gao
Honggu Chen
Yan Gao
Yao Hu
Zhan Qin
VLM
288
12
0
18 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
416
62
0
09 Apr 2024
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2024
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
Hang Hua
Chenliang Xu
VGen
378
23
0
24 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
419
11
0
21 Mar 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Contextual AD Narration with Interleaved Multimodal SequenceComputer Vision and Pattern Recognition (CVPR), 2024
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
472
7
0
19 Mar 2024
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
Zhende Song
Chenchen Wang
Jiamu Sheng
C. Zhang
Gang Yu
Jiayuan Fan
Tao Chen
VGen
468
21
0
03 Mar 2024
Previous
123
Next