v1v2v3v4v5 (latest)

Video Understanding with Large Language Models: A Survey

29 December 2023

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (2325★)

Papers citing "Video Understanding with Large Language Models: A Survey"

50 / 105 papers shown

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

...

210

14 Mar 2025

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi

Dimitra Emmanouilidou

H. Gamper

460

12 Mar 2025

ComicsPAP: understanding comic strips by picking the correct panelIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025

Ernest Valveny Llobet

Dimosthenis Karatzas

457

11 Mar 2025

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban SpacesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

338

08 Mar 2025

CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

...

372

15 Feb 2025

MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

...

287

21 Jan 2025

OneLLM: One Framework to Align All Modalities with LanguageComputer Vision and Pattern Recognition (CVPR), 2023

553

194

10 Jan 2025

Generative AI for Cel-Animation: A Survey

...

695

08 Jan 2025

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMComputer Vision and Pattern Recognition (CVPR), 2024

...

418

31 Dec 2024

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024

478

31 Dec 2024

When SAM2 Meets Video Shadow and Mirror Detection

Leiping Jie

VLM

216

26 Dec 2024

Do Language Models Understand Time?The Web Conference (WWW), 2024

Xi Ding

Lei Wang

919

18 Dec 2024

VisionZip: Longer is Better but Not Necessary in Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024

281

105

05 Dec 2024

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

1.1K

21 Nov 2024

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Computer Vision and Pattern Recognition (CVPR), 2024

...

408

17 Nov 2024

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

422

14 Nov 2024

EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation

Hao Liang

Zirong Chen

Feiyu Xiong

Wentao Zhang

309

11 Nov 2024

Making Every Frame Matter: Continuous Activity Recognition in Streaming Video via Adaptive Video Context Modeling

Hao Wu

Yunxin Liu

Fengyuan Xu

588

19 Oct 2024

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

254

15 Oct 2024

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Yunhe Wang

200

14 Oct 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Chenliang Xu

232

13 Oct 2024

$G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models$

^{2}

TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models

183

10 Oct 2024

Temporal Reasoning Transfer from Text to VideoInternational Conference on Learning Representations (ICLR), 2024

Lei Li

Chenxin An

Xu Sun

Qi Liu

179

08 Oct 2024

Enhancing Temporal Modeling of Video LLMs via Time GatingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Liwei Wang

188

08 Oct 2024

On Efficient Variants of Segment Anything Model: A SurveyInternational Journal of Computer Vision (IJCV), 2024

505

07 Oct 2024

UAL-Bench: The First Comprehensive Unusual Activity Localization BenchmarkIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

255

02 Oct 2024

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

...

Huaijian Zhang

293

27 Sep 2024

EAGLE: Egocentric AGgregated Language-video EngineACM Multimedia (MM), 2024

222

26 Sep 2024

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

490

23 Sep 2024

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Ming Li

Keyu Chen

Ziqian Bi

Ming Liu

Xinyuan Song

...

Jinlang Wang

Sen Zhang

Xuanhe Pan

Jiawei Xu

Pohsun Feng

OffRL

275

17 Sep 2024

HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

Alexandru Bobe

Jan van Gemert

197

16 Sep 2024

VideoQA in the Era of LLMs: An Empirical StudyInternational Journal of Computer Vision (IJCV), 2024

...

344

08 Aug 2024

CoMMIT: Coordinated Multimodal Instruction Tuning

164

29 Jul 2024

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Bolin Ding

Yaliang Li

Shuiguang Deng

347

11 Jul 2024

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

272

24 Jun 2024

Towards Event-oriented Long Video Understanding

Kun Zhou

Wayne Xin Zhao

Bingning Wang

Weipeng Chen

Ji-Rong Wen

VLM

201

20 Jun 2024

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Yunxin Li

388

17 Jun 2024

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Xinyao Wang

Guang Chen

Dawei Du

Ye Yuan

Longyin Wen

251

15 Jun 2024

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

Jongwoo Park

369

13 Jun 2024

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

570

09 Jun 2024

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Elias Stengel-Eskin

Gedas Bertasius

Mohit Bansal

472

147

29 May 2024

A Survey of Multimodal Large Language Model from A Data-centric Perspective

...

Conghui He

373

26 May 2024

Graphic Design with Large Multimodal Model

327

22 Apr 2024

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

416

18 Apr 2024

From Image to Video, what do we need in multimodal LLMs?

288

18 Apr 2024

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

416

09 Apr 2024

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2024

378

24 Mar 2024

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

Ahmad A Mahmood

Ashmal Vayani

Muzammal Naseer

Salman Khan

Fahad Shahbaz Khan

LRM

419

21 Mar 2024

Contextual AD Narration with Interleaved Multimodal SequenceComputer Vision and Pattern Recognition (CVPR), 2024

472

19 Mar 2024

DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

468

03 Mar 2024