Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2404.03413
Cited By
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
4 April 2024
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Deyao Zhu
Jian Ding
Mohamed Elhoseiny
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (29 upvotes)
Papers citing
"MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens"
46 / 46 papers shown
DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Rui Zhang
Hongxia Wang
Hangqing Liu
Yang Zhou
Q. Zeng
126
0
0
28 Nov 2025
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya
Bicheng Xu
Sanjay Haresh
Reza Pourreza
Litian Liu
Sunny Panchal
Pulkit Madan
Leonid Sigal
Roland Memisevic
142
1
0
27 Nov 2025
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang
Z. Wang
Zhihang Yuan
Yifu Ding
Ruihao Gong
Jinyang Guo
Xianglong Liu
Jun Zhang
MoE
VLM
307
2
0
19 Nov 2025
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Zhenyu Yang
Kairui Zhang
Yuhang Hu
Bing Wang
Shengsheng Qian
Bin Wen
Fan Yang
Tingting Gao
Weiming Dong
Changsheng Xu
OffRL
AI4TS
VLM
298
5
0
07 Nov 2025
HouseTour: A Virtual Real Estate A(I)gent
Ata Çelen
Marc Pollefeys
Daniel Barath
Iro Armeni
VGen
283
3
0
20 Oct 2025
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
Yifeng Yao
Yike Yun
Jing Wang
Huishuai Zhang
Dongyan Zhao
Ke Tian
Zhihao Wang
Minghui Qiu
Tao Wang
CLIP
VGen
190
6
0
14 Oct 2025
VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Yidan Zhang
Mutian Xu
Yiming Hao
Kun Zhou
Jiahao Chang
Xiaoqiang Liu
Pengfei Wan
Hongbo Fu
Xiaoguang Han
VGen
202
1
0
25 Sep 2025
Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
Zhihao He
Tianyao He
Yun Xu
Yun Xu
Huabin Liu
Chaofan Gan
Gui Zou
W. Lin
285
3
0
16 Sep 2025
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
Qianrui Zhou
Hua Xu
Yifan Wang
Xinzhi Dong
Hanlei Zhang
134
2
0
01 Sep 2025
Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
Hou Xia
Zheren Fu
Fangcan Ling
Jiajun Li
Yi Tu
Zhendong Mao
Yongdong Zhang
238
0
0
27 Aug 2025
RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang
Yuqian Yuan
Yunxuan Mao
Kehan Li
Jiangpin Liu
Zhikai Wang
Xin Li
F. Wang
Deli Zhao
VGen
LM&Ro
242
7
0
19 Aug 2025
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard
Mehrzad Mohammadi
Yi Shen
Zhixi Cai
Hamid Rezatofighi
347
2
0
14 Aug 2025
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
ACM Conference on Recommender Systems (RecSys), 2025
Marco De Nadai
Andreas Damianou
M. Lalmas
VLM
159
0
0
13 Aug 2025
KFFocus: Highlighting Keyframes for Enhanced Video Understanding
Ming-Jun Nie
Chunwei Wang
Hang Xu
Li Zhang
VGen
188
0
0
12 Aug 2025
Vision Generalist Model: A Survey
International Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
318
0
0
11 Jun 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay
Mukul Ranjan
Zhiqiang Shen
Mohamed Elhoseiny
VLM
245
9
0
30 May 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang
Yan Shu
Zhifei Yang
Zhifei Yang
Yan Zhang
...
Gangyan Zeng
Gangyan Zeng
Yu Zhou
Andrii Zadaianchuk
Nicu Sebe
CoGe
378
5
0
28 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
755
28
0
27 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
400
4
0
20 May 2025
EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language
Phoebe Chua
Cathy Mengying Fang
Takehiko Ohkawa
Raja Kushalnagar
Suranga Nanayakkara
Pattie Maes
SLR
405
3
0
20 May 2025
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Yong Ren
Chenxing Li
Le Xu
Hao Gu
Duzhen Zhang
Yujie Chen
Manjie Xu
Ruibo Fu
Shan Yang
Dong Yu
LRM
511
1
0
19 May 2025
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao
You Li
Y. X. Wei
Lei Li
Shuhuai Ren
...
Sida Li
Dianbo Sui
Qi Liu
Yanzhe Zhang
Xu Sun
314
35
0
24 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Yue Zhao
Shuang Xu
Bo Xu
VLM
318
3
0
09 Apr 2025
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Shuming Liu
Chen Zhao
Tianqi Xu
Bernard Ghanem
VLM
369
34
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
327
5
0
26 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLM
MQ
422
10
0
20 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Computer Vision and Pattern Recognition (CVPR), 2025
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
295
15
0
17 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
Andrii Zadaianchuk
VOS
RALM
VLM
442
11
0
12 Mar 2025
FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
Ziyi Yang
Fanqi Wan
Longguang Zhong
Canbin Huang
Guosheng Liang
Xiaojun Quan
MoMe
311
14
0
06 Mar 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
International Conference on Learning Representations (ICLR), 2025
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Chuanrui Hu
Jiahong Wu
Fan Yang
Weiming Dong
Changsheng Xu
421
36
0
15 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Neural Information Processing Systems (NeurIPS), 2024
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
458
49
0
28 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Jiayi Zhang
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
499
35
0
06 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Yuan Liu
Hengshuang Zhao
845
61
0
02 Jan 2025
VidCtx: Context-aware Video Question Answering with Image Models
Andreas Goulas
Vasileios Mezaris
Ioannis Patras
1.0K
2
0
23 Dec 2024
Do Language Models Understand Time?
The Web Conference (WWW), 2024
Xi Ding
Lei Wang
980
12
0
18 Dec 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
437
11
0
25 Nov 2024
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu
Zhuohao Ni
Nicholas Crispino
Zihao Yu
Michael Bendersky
...
Ruoxi Jia
Xin Liu
Lingjuan Lyu
Dawn Song
Chenguang Wang
VLM
MLLM
413
0
0
15 Nov 2024
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
International Conference on Learning Representations (ICLR), 2024
Zhangheng Li
Keen You
Hao Zhang
Di Feng
Harsh Agrawal
Xiujun Li
Mohana Prasad Sathya Moorthy
Jeff Nichols
Yue Yang
Zhe Gan
MLLM
523
47
0
24 Oct 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S Ryoo
Honglu Zhou
Shrikant B. Kendre
Can Qin
Le Xue
...
Kanchana Ranasinghe
Caiming Xiong
Ran Xu
Caiming Xiong
Juan Carlos Niebles
VGen
347
29
0
21 Oct 2024
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
Bohao Xing
Zitong Yu
Xin Liu
Kaishen Yuan
Qilang Ye
Weicheng Xie
Huanjing Yue
Jingyu Yang
Heikki Kälviäinen
238
30
0
21 Aug 2024
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion
Yanqin Jiang
Chaohui Yu
Chenjie Cao
Fan Wang
Weiming Hu
Jin Gao
VGen
264
44
0
16 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
332
129
0
30 Jun 2024
InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
Kirolos Ataallah
Eslam Abdelrahman
Mahmoud Ahmed
Chenhui Gou
Khushbu Pahwa
Jian Ding
Mohamed Elhoseiny
VLM
370
14
0
28 Jun 2024
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Rohit K Bharadwaj
Hanan Gani
Muzammal Naseer
Fahad Shahbaz Khan
Salman Khan
437
17
0
14 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Neural Information Processing Systems (NeurIPS), 2024
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
421
371
0
06 Jun 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
860
202
0
29 Dec 2023
1
Page 1 of 1