Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.17005
Cited By
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
28 November 2023
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Yi Liu
Zun Wang
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark"
50 / 312 papers shown
Title
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding
Qi Wu
Quanlong Zheng
Yanhao Zhang
Junlin Xie
Jinguo Luo
...
Peng Liu
Qingsong Xie
Ru Zhen
Haonan Lu
Zhenyu Yang
VLM
58
0
0
31 Mar 2025
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Yi Chen
Yuying Ge
Rui Wang
Yixiao Ge
Lu Qiu
Ying Shan
Xihui Liu
ReLM
VLM
OffRL
LRM
52
2
0
31 Mar 2025
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Y. Wang
Y. Wang
Bo Chen
Tong Wu
Dongyan Zhao
Zilong Zheng
VLM
MLLM
55
1
0
29 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
36
0
0
29 Mar 2025
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos
Yuxuan Li
Vijay Veerabadran
Michael L. Iuzzolino
Brett D. Roads
Asli Celikyilmaz
Karl Ridgeway
EgoV
54
0
0
28 Mar 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou
Malvina Nikandrou
Georgios Pantazopoulos
Danae Sanchez Villegas
Phillip Rust
Ruchira Dhar
Daniel Hershcovich
Anders Søgaard
34
0
0
28 Mar 2025
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Xiaoqin Wang
Xusen Ma
Xianxu Hou
Meidan Ding
Yudong Li
Junliang Chen
Wenting Chen
Xiaoyang Peng
LinLin Shen
CVBM
71
0
0
27 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
F. Khan
52
0
0
27 Mar 2025
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng
Kaixiong Gong
B. Li
Zonghao Guo
Yibing Wang
Tianshuo Peng
J. Wu
Xiaoying Zhang
Benyou Wang
Xiangyu Yue
AI4TS
SyDa
LRM
46
13
0
27 Mar 2025
On Large Multimodal Models as Open-World Image Classifiers
Alessandro Conti
Massimiliano Mancini
Enrico Fini
Yiming Wang
Paolo Rota
Elisa Ricci
VLM
Presented at
ResearchTrend Connect | VLM
on
07 May 2025
72
0
0
27 Mar 2025
MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
Liuyue Xie
George Z. Wei
Avik Kuthiala
Ce Zheng
Ananya Bal
...
Rohan Choudhury
Morteza Ziyadi
Xu Zhang
Hao Yang
László A. Jeni
62
0
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
72
0
0
26 Mar 2025
Beyond Intermediate States: Explaining Visual Redundancy through Language
Dingchen Yang
Bowen Cao
Anran Zhang
Weibo Gu
Winston Hu
Guang Chen
VLM
79
0
0
26 Mar 2025
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
X. Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
86
12
0
26 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
79
0
0
25 Mar 2025
ACVUBench: Audio-Centric Video Understanding Benchmark
Y. Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Y. Li
P. Li
Yifan Jiang
W. Li
Z. Ma
Chao Zhang
AuLLM
CoGe
53
0
0
25 Mar 2025
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
Dohwan Ko
S. Kim
Yumin Suh
Vijay Kumar B.G
Minseo Yoon
Manmohan Chandraker
Hyunwoo J. Kim
LRM
38
0
0
25 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
56
0
0
25 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Y. Yang
Afshin Dehghan
51
1
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
60
1
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
72
0
0
24 Mar 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
55
0
0
22 Mar 2025
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Yiming Zhao
Y. Zeng
Yukun Qi
Y. Liu
Lin Yen-Chen
Zehui Chen
Xikun Bao
Jie Zhao
Feng Zhao
VLM
53
2
0
22 Mar 2025
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
Yuchen Sun
Shanhui Zhao
Tao Yu
Hao Wen
Samith Va
Mengwei Xu
Yuanchun Li
Chongyang Zhang
LLMAG
62
0
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
48
0
0
22 Mar 2025
PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi
Weilong Yan
Gang Xu
Yumeng Li
Y. Li
Z. Li
Fei Richard Yu
Ming Li
Si Yong Yeo
43
0
0
21 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
H. Wang
VLM
MQ
84
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
62
1
0
20 Mar 2025
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Jie M. Zhang
Zheng Yuan
Z. Wang
Bei Yan
Sibo Wang
Xiangkui Cao
Zonghui Guo
Shiguang Shan
Xilin Chen
ELM
36
0
0
20 Mar 2025
Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Thanh-Son Nguyen
Hong Yang
Tzeh Yuan Neoh
Hao Zhang
Ee Yeo Keat
Basura Fernando
NAI
54
0
0
19 Mar 2025
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
Chongjun Tu
Lin Zhang
Pengtao Chen
Peng Ye
Xianfang Zeng
W. Cheng
Gang Yu
Tao Chen
79
0
0
19 Mar 2025
Impossible Videos
Zechen Bai
Hai Ci
Mike Zheng Shou
EGVM
VGen
70
0
0
18 Mar 2025
Aligning Multimodal LLM with Human Preference: A Survey
Tao Yu
Y. Zhang
Chaoyou Fu
Junkang Wu
Jinda Lu
...
Qingsong Wen
Z. Zhang
Yan Huang
Liang Wang
T. Tan
73
2
0
18 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
58
0
0
18 Mar 2025
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Yiqi Zhu
Z. Wang
C. Zhang
Peng Li
Yang Liu
CoGe
VLM
63
0
0
18 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Y. Liu
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
71
0
0
17 Mar 2025
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Chiara Plizzari
A. Tonioni
Yongqin Xian
Achin Kulshrestha
F. Tombari
EgoV
56
0
0
17 Mar 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu
Q. Yang
Yuan-Ming Li
Yi-Xing Peng
Kun-Yu Lin
Xihan Wei
Jian-Fang Hu
Xiaohua Xie
Wei-Shi Zheng
VLM
58
1
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
B. Wang
Weipeng Chen
J. Liu
55
0
0
17 Mar 2025
VITED: Video Temporal Evidence Distillation
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
50
0
0
17 Mar 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng
Jian Hu
Ziquan Liu
Chenyang Si
Wei Li
Shaogang Gong
LRM
68
2
0
14 Mar 2025
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen
Guoqiang Gong
Tao He
Yifeng Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
VLM
63
0
0
14 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Wenhu Chen
Mamba
57
3
0
14 Mar 2025
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Yuanxin Liu
Rui Zhu
Shuhuai Ren
Jiacong Wang
Haoyuan Guo
Xu Sun
Lu Jiang
68
1
0
13 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
J. Zhang
H. Li
Y. Chen
VLM
62
2
0
13 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Z. Wang
Yang Liu
Peng Li
Y. Wang
VLM
67
0
0
13 Mar 2025
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Liqiang Nie
62
1
0
13 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
75
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
58
0
0
12 Mar 2025
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Qiji Zhou
Yifan Gong
Guangsheng Bao
Hongjie Qiu
Jinqiang Li
Xiangrong Zhu
Huajian Zhang
Yue Zhang
LRM
44
0
0
12 Mar 2025
Previous
1
2
3
4
5
6
7
Next