MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

4 April 2024

ArXiv (abs)PDF HTML HuggingFace (29 upvotes)

Papers citing "MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens"

46 / 46 papers shown

DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

126

28 Nov 2025

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya

142

27 Nov 2025

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

307

19 Nov 2025

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

298

07 Nov 2025

HouseTour: A Virtual Real Estate A(I)gent

283

20 Oct 2025

K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding

190

14 Oct 2025

VC-Agent: An Interactive Agent for Customized Video Dataset Collection

202

25 Sep 2025

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning

285

16 Sep 2025

LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition

134

01 Sep 2025

Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

238

27 Aug 2025

RynnEC: Bringing MLLMs into Embodied World

242

19 Aug 2025

JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

347

14 Aug 2025

Describe What You See with Multimodal Large Language Models to Enhance Video RecommendationsACM Conference on Recommender Systems (RecSys), 2025

159

13 Aug 2025

KFFocus: Highlighting Keyframes for Enhanced Video Understanding

188

12 Aug 2025

Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025

...

318

11 Jun 2025

Time Blindness: Why Video-Language Models Can't See What Humans Can?

245

30 May 2025

VidText: Towards Comprehensive Evaluation for Video Text Understanding

...

378

28 May 2025

HoliTom: Holistic Token Merging for Fast Video Large Language Models

755

27 May 2025

Domain Adaptation of VLM for Soccer Video Understanding

400

20 May 2025

EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

405

20 May 2025

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

511

19 May 2025

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

...

314

24 Apr 2025

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

318

09 Apr 2025

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

369

27 Mar 2025

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

327

26 Mar 2025

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

422

20 Mar 2025

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit CooperationComputer Vision and Pattern Recognition (CVPR), 2025

295

17 Mar 2025

Memory-enhanced Retrieval Augmentation for Long Video Understanding

442

12 Mar 2025

FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion

311

06 Mar 2025

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingInternational Conference on Learning Representations (ICLR), 2025

421

15 Feb 2025

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided SearchNeural Information Processing Systems (NeurIPS), 2024

458

28 Jan 2025

Visual Large Language Models for Generalized and Specialized Applications

499

06 Jan 2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

845

02 Jan 2025

VidCtx: Context-aware Video Question Answering with Image Models

Andreas Goulas

Vasileios Mezaris

Ioannis Patras

1.0K

23 Dec 2024

Do Language Models Understand Time?The Web Conference (WWW), 2024

Xi Ding

Lei Wang

980

18 Dec 2024

VideoOrion: Tokenizing Object Dynamics in Videos

Sipeng Zheng

Zongqing Lu

437

25 Nov 2024

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

...

413

15 Nov 2024

Ferret-UI 2: Mastering Universal User Interface Understanding Across PlatformsInternational Conference on Learning Representations (ICLR), 2024

Mohana Prasad Sathya Moorthy

523

24 Oct 2024

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

...

347

21 Oct 2024

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Xin Liu

Jingyu Yang

238

21 Aug 2024

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Fan Wang

264

16 Jul 2024

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang

Liping Yuan

Yuchen Zhang

332

129

30 Jun 2024

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

370

28 Jun 2024

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Salman Khan

437

14 Jun 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsNeural Information Processing Systems (NeurIPS), 2024

Lin Chen

Xilin Wei

Jinsong Li

Xiaoyi Dong

Pan Zhang

...

Li Yuan

Yu Qiao

Dahua Lin

Feng Zhao

Jiaqi Wang

421

371

06 Jun 2024

Video Understanding with Large Language Models: A Survey

...

860

202

29 Dec 2023