Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.21075
Cited By
v1
v2
v3 (latest)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (25 upvotes)
Papers citing
"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"
50 / 550 papers shown
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
Changli Tang
Qinfan Xiao
Ke Mei
Tianyi Wang
Fengyun Rao
Chao Zhang
118
0
0
26 Sep 2025
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Sicheng Tao
Jia-Chen Gu
Yibo Yan
Junyan Zhang
Yubo Gao
...
Shuhang Xun
Yuxuan Fan
Hong Chen
Jianxiang He
Xuming Hu
LRM
393
5
0
25 Sep 2025
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan
Xinhao Li
Yinan He
Zhengrong Yue
Xiangyu Zeng
Yali Wang
Yu Qiao
Limin Wang
Yi Wang
MLLM
VLM
LRM
213
13
0
25 Sep 2025
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Benedetta Liberatori
Alessandro Conti
Lorenzo Vaquero
Yiming Wang
Elisa Ricci
Paolo Rota
147
1
0
23 Sep 2025
Does Audio Matter for Modern Video-LLMs and Their Benchmarks?
Geewook Kim
Minjoon Seo
AuLLM
117
0
0
22 Sep 2025
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Ye Liu
Zongyang Ma
Junfu Pu
Zhongang Qi
Yang Wu
Mingyu Ding
Chang Wen Chen
MLLM
ObjD
LRM
381
4
0
22 Sep 2025
Qwen3-Omni Technical Report
Jin Xu
Zhifang Guo
Hangrui Hu
Yunfei Chu
Xiong Wang
...
Bowen Yu
Jianxin Yang
Le Yu
Jingren Zhou
Junyang Lin
AuLLM
VGen
VLM
215
60
0
22 Sep 2025
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Zhongyuan Bao
Lejun Zhang
LRM
259
2
0
19 Sep 2025
ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding
Kehua Chen
VGen
118
1
0
19 Sep 2025
Frame Sampling Strategies Matter: A Benchmark for small vision language models
Marija Brkic
Anas Filali Razzouki
Yannis Tevissen
Khalil Guetari
Mounim A. El Yacoubi
115
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
127
1
0
17 Sep 2025
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Nisarg A. Shah
Amir Ziai
Chaitanya Ekanadham
Vishal M. Patel
VGen
CoGe
ELM
133
0
0
17 Sep 2025
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
249
7
0
17 Sep 2025
Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
Zhihao He
Tianyao He
Yun Xu
Yun Xu
Huabin Liu
Chaofan Gan
Gui Zou
W. Lin
201
2
0
16 Sep 2025
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu
Zefan Wang
Chongyi Wang
Fuwei Huang
Wenshuo Ma
...
Ning Ding
Xu Han
Xingtai Lv
Zhiyuan Liu
Maosong Sun
MLLM
VLM
198
24
0
16 Sep 2025
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
Meng Luo
Shengqiong Wu
Liqiang Jing
Tianjie Ju
Li Zheng
...
Jiebo Luo
William Yang Wang
Hao Fei
Yang Deng
Wynne Hsu
169
1
0
15 Sep 2025
FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning
Haodong Chen
Haojian Huang
XinXiang Yin
Dian Shao
LRM
175
2
0
15 Sep 2025
DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Chao Yuan
Y. Yang
Y. Yang
Zach Cheng
99
4
0
11 Sep 2025
Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
Piyush Bagad
Andrew Zisserman
AI4TS
243
2
0
10 Sep 2025
MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Garry Yang
Zizhe Chen
Man Hon Wong
Haoyu Lei
Yongqiang Chen
Zhenguo Li
Kaiwen Zhou
James Cheng
180
0
0
10 Sep 2025
AdsQA: Towards Advertisement Video Understanding
Xinwei Long
Kai Tian
Peng Xu
Guoli Jia
Jingxuan Li
...
Che Jiang
Hao Xu
Yang Liu
Jiaheng Ma
Bowen Zhou
144
2
0
10 Sep 2025
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Hyungjin Chung
Hyelin Nam
J. Kim
Hyojun Go
Byeongjun Park
Junho Kim
J. Lee
Seongsu Ha
Byung-Hoon Kim
142
0
0
09 Sep 2025
In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
Taiying Peng
Jiacheng Hua
Miao Liu
Feng Lu
138
3
0
09 Sep 2025
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Gagan Mundada
Yash Vishe
Amit Namburi
Xin Xu
Zachary Novack
Julian McAuley
Junda Wu
LRM
133
4
0
05 Sep 2025
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Honglu Zhou
Xiangyu Peng
Shrikant B. Kendre
Michael S Ryoo
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
120
1
0
03 Sep 2025
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Kimihiro Hasegawa
Wiradee Imrattanatrai
Masaki Asada
Susan Holm
Yuran Wang
Vincent Zhou
Ken Fukuda
Teruko Mitamura
163
2
0
03 Sep 2025
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Ouxiang Li
Yuan Wang
Xinting Hu
Huijuan Huang
Rui Chen
Jiarong Ou
Xin Tao
Pengfei Wan
Xiaojuan Qi
Fuli Feng
EGVM
CoGe
LRM
322
6
0
03 Sep 2025
VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality
Srihari Bandraupalli
Anupam Purwar
VLM
72
1
0
03 Sep 2025
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Z. Chen
Chenxi Wang
Ningyu Zhang
Feng Zhang
259
2
0
02 Sep 2025
PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
Mennatullah Siam
VGen
118
0
0
02 Sep 2025
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Huang Fang
Mengxi Zhang
Heng Dong
Wei Li
Z. Wang
Qifeng Zhang
Xueyun Tian
Yucheng Hu
Hang Li
LM&Ro
LRM
168
7
0
01 Sep 2025
Kwai Keye-VL 1.5 Technical Report
Biao Yang
Bin Wen
Boyang Ding
Changyi Liu
Chenglong Chu
...
S. Wang
X. Luo
Yan Li
Yuhang Hu
Zixing Zhang
VLM
333
17
0
01 Sep 2025
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Junjie Chen
Xuyang Liu
Zichen Wen
Yiyu Wang
Siteng Huang
Honggang Chen
MQ
VLM
89
5
0
01 Sep 2025
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
Xiangchen Wang
Jinrui Zhang
Teng Wang
Haigang Zhang
Feng Zheng
142
0
0
31 Aug 2025
LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression
Lianyu Hu
Fanhua Shang
Wei Feng
Liang Wan
MLLM
VLM
135
0
0
30 Aug 2025
VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
Zhihong Zhang
Xiaojian Huang
Jin Xu
Zhuodong Luo
Xinzhi Wang
Jiansheng Wei
Xuejin Chen
VLM
119
1
0
30 Aug 2025
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Hao Lu
Jiahao Wang
Y. Zhang
Ruohui Wang
Xuanyu Zheng
Yepeng Tang
Dahua Lin
Lewei Lu
VLM
123
0
0
29 Aug 2025
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Yuan Xie
Tianshui Chen
Zheng Ge
L. Ni
LRM
109
9
0
28 Aug 2025
Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
Hou Xia
Zheren Fu
Fangcan Ling
Jiajun Li
Yi Tu
Zhendong Mao
Yongdong Zhang
192
0
0
27 Aug 2025
CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning
Nannan Zhu
Yonghao Dong
T. Wang
Xueqian Li
Shengjun Deng
...
Tiantian Geng
Guo Niu
Hanyan Huang
Xiongfei Yao
Shuaiwei Jiao
LRM
247
3
0
27 Aug 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang
Zhangwei Gao
Lixin Gu
Hengjun Pu
Long Cui
...
Bowen Zhou
Kai Chen
Yu Qiao
Wenhai Wang
Gen Luo
MLLM
LRM
305
279
0
25 Aug 2025
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Yogesh Kumar
VLM
104
0
0
25 Aug 2025
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Kang Zeng
Guojin Zhong
Jintao Cheng
Jin Yuan
Zhiyong Li
140
0
0
25 Aug 2025
Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms
G. Chaudhry
Esha Choukse
Haoran Qiu
Íñigo Goiri
Rodrigo Fonseca
Adam Belay
Ricardo Bianchini
128
2
0
22 Aug 2025
An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou
Ziyu Ma
Zicheng Duan
Haoyu He
Feng Chen
Akide Liu
Bohan Zhuang
Jianfei Cai
H. Rezatofighi
152
1
0
21 Aug 2025
StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
Yanlai Yang
Zhuokai Zhao
Satya Narayan Shukla
Aashu Singh
Shlok Kumar Mishra
Lizhu Zhang
Mengye Ren
VLM
124
7
0
21 Aug 2025
HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Keliang Li
Hongze Shen
Hao Shi
Ruibing Hou
Hong Chang
...
Wen Wang
Yiling Wu
Shihong Deng
Shiguang Shan
Xilin Chen
LRM
182
1
0
19 Aug 2025
Mitigating Easy Option Bias in Multiple-Choice Question Answering
Hao Zhang
Chen Li
Basura Fernando
AAML
132
0
0
19 Aug 2025
EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
Ashish Seth
Utkarsh Tyagi
Ramaneswaran Selvakumar
Nishit Anand
Sonal Kumar
Sreyan Ghosh
R. Duraiswami
Chirag Agarwal
Dinesh Manocha
MLLM
HILM
VLM
229
1
0
18 Aug 2025
Ovis2.5 Technical Report
Shiyin Lu
Yan Zhao
Yu Xia
Yuwei Hu
Shanshan Zhao
...
Yuhui Chen
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
VLM
LRM
156
33
0
15 Aug 2025
Previous
1
2
3
4
5
...
9
10
11
Next