Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2401.10529
Cited By
v1
v2 (latest)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
19 January 2024
Xiyao Wang
Yuhang Zhou
Xiaoyu Liu
Hongjin Lu
Yuancheng Xu
Feihong He
Jaehong Yoon
Taixi Lu
Gedas Bertasius
Mohit Bansal
Huaxiu Yao
Furong Huang
LRM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (30★)
Papers citing
"Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences"
50 / 67 papers shown
Title
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
Ming Zhong
Y. Wang
Liuzhou Zhang
Arctanx An
Renrui Zhang
Hao Liang
Ming Lu
Ying Shen
Wentao Zhang
144
0
0
22 Nov 2025
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Aakriti Agrawal
Gouthaman KV
R. Aralikatti
Gauri Jagatap
Jiaxin Yuan
Vijay Kamarshi
Andrea Fanelli
Furong Huang
VLM
116
0
0
07 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
264
1
0
06 Nov 2025
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Yongyuan Liang
Wei Chow
Feng Li
Ziqiao Ma
Xiyao Wang
Jiageng Mao
Jiuhai Chen
Jiatao Gu
Y. Wang
Furong Huang
LRM
188
1
0
03 Nov 2025
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Gowreesh Mago
Pascal Mettes
Stevan Rudinac
120
0
0
28 Aug 2025
Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
Jianyi Zhang
Xu Ji
Ziyin Zhou
Yuchen Zhou
Shubo Shi
Haoyu Wu
Zhen Li
Shizhao Liu
ReLM
CoGe
LRM
VLM
134
1
0
01 Aug 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria
Adinath Madhavrao Dukre
Feilong Tang
Sara Atito
Sudipta Roy
Muhammad Awais
Muhammad Haris Khan
Imran Razzak
VLM
223
0
0
18 Jun 2025
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yingjin Song
Yupei Du
Denis Paperno
Albert Gatt
MLLM
217
1
0
12 Jun 2025
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Liangliang You
Junchi Yao
Shu Yang
Guimin Hu
Lijie Hu
Di Wang
MLLM
219
2
0
08 Jun 2025
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou
Le Xue
Honglu Zhou
Silvio Savarese
Ran Xu
Caiming Xiong
Chris Callison-Burch
Mark Yatskar
Juan Carlos Niebles
243
0
0
02 Jun 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Qi Zhang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
480
21
0
26 Apr 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Aviv Slobodkin
Hagai Taitelbaum
Yonatan Bitton
Brian Gordon
Michal Sokolik
Nitzan Bitton-Guetta
Almog Gueta
Royi Rassin
Itay Laish
Dani Lischinski
EGVM
VGen
324
1
0
24 Apr 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Xinze Wang
Zhiyong Yang
Chao Feng
Hongjin Lu
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Furong Huang
Lijuan Wang
OODD
ReLM
LRM
VLM
542
68
0
10 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Jiuxiang Gu
Franck Dernoncourt
Wanrong Zhu
Wanrong Zhu
Tianyi Zhou
Tong Sun
383
12
0
07 Apr 2025
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
Yongyi Zang
Sean O'Brien
Taylor Berg-Kirkpatrick
Julian McAuley
Cheng-i Wang
AuLLM
301
10
0
01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
392
6
0
29 Mar 2025
Aligning Multimodal LLM with Human Preference: A Survey
Tao Yu
Yujiao Shi
Chaoyou Fu
Junkang Wu
Jinda Lu
...
Qingsong Wen
Zheng Zhang
Yan Huang
Liang Wang
Tieniu Tan
769
12
0
18 Mar 2025
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Yiqi Zhu
Zihan Wang
Chen Zhang
Ziwei Sun
Yang Liu
CoGe
VLM
209
2
0
18 Mar 2025
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models
Boyu Jia
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
LRM
187
5
0
03 Mar 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
Computer Vision and Pattern Recognition (CVPR), 2025
Peijie Wang
Zhong-Zhi Li
Fei Yin
Xin Yang
Dekang Ran
Cheng-Lin Liu
LRM
478
26
0
28 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
510
12
0
26 Feb 2025
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Danae Sánchez Villegas
Ingo Ziegler
Desmond Elliott
LRM
258
4
0
26 Feb 2025
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
1.0K
0
0
18 Feb 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hugh Mee Wong
Rick Nouwen
Albert Gatt
400
0
0
17 Feb 2025
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Haibo Tong
Zhaoyang Wang
Zhe Chen
Haonian Ji
Shi Qiu
...
Peng Xia
Mingyu Ding
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVM
VGen
579
8
0
03 Feb 2025
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Yuhang Zhou
Giannis Karamanolakis
Victor Soto
Anna Rumshisky
Mayank Kulkarni
Furong Huang
Wei Ai
Jianhua Lu
MoMe
457
5
0
03 Feb 2025
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Computer Vision and Pattern Recognition (CVPR), 2024
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
443
5
0
20 Dec 2024
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2024
Chenyu Yang
Xuan Dong
X. Zhu
Weijie Su
Jiahao Wang
H. Tian
Zheyu Chen
Wenhai Wang
Lewei Lu
Jifeng Dai
VLM
192
9
0
12 Dec 2024
Large Language Model Benchmarks in Medical Tasks
Lawrence K. Q. Yan
Ming Li
Yujiao Shi
Cheng Fei
Cheng Fei
...
Junyu Liu
Xinyuan Song
Riyang Bao
Zekun Jiang
Ziyuan Qin
LM&MA
AI4MH
567
18
0
28 Oct 2024
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Conference on Robot Learning (CoRL), 2024
Nils Blank
Moritz Reuss
Marcel Rühle
Ömer Erdinç Yagmurlu
Fabian Wenzel
Oier Mees
Rudolf Lioutikov
LM&Ro
OffRL
252
13
0
23 Oct 2024
Mitigating Object Hallucination via Concentric Causal Attention
Neural Information Processing Systems (NeurIPS), 2024
Yun Xing
Yiheng Li
Ivan Laptev
Shijian Lu
214
38
0
21 Oct 2024
A Survey of Hallucination in Large Visual Language Models
Wei Lan
Wenyi Chen
Qingfeng Chen
Shirui Pan
Huiyu Zhou
Yi-Lun Pan
LRM
299
11
0
20 Oct 2024
Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
Seulbi Lee
J. Kim
Sangheum Hwang
LRM
938
3
0
19 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
368
12
0
16 Oct 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Peng Xia
Siwei Han
Shi Qiu
Yiyang Zhou
Zhaoyang Wang
...
Chenhang Cui
Mingyu Ding
Linjie Li
Lijuan Wang
Huaxiu Yao
275
28
0
14 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Computer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
333
64
0
10 Oct 2024
VHELM: A Holistic Evaluation of Vision Language Models
Neural Information Processing Systems (NeurIPS), 2024
Tony Lee
Haoqin Tu
Chi Heem Wong
Wenhao Zheng
Yiyang Zhou
...
Josselin Somerville Roberts
Michihiro Yasunaga
Huaxiu Yao
Cihang Xie
Abigail Z. Jacobs
VLM
273
41
0
09 Oct 2024
LLaVA-Critic: Learning to Evaluate Multimodal Models
Computer Vision and Pattern Recognition (CVPR), 2024
Tianyi Xiong
Xinze Wang
Dong Guo
Qinghao Ye
Haoqi Fan
Quanquan Gu
Heng Huang
Chunyuan Li
MLLM
VLM
LRM
310
91
0
03 Oct 2024
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
International Conference on Learning Representations (ICLR), 2024
Hong Li
Nanxi Li
Yuanjie Chen
Jianbin Zhu
Qinlu Guo
Cewu Lu
Yong-Lu Li
MLLM
256
3
0
02 Oct 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li
Guikun Chen
Hanrong Shi
Jun Xiao
Long Chen
319
23
0
21 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Neural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
443
5
0
19 Sep 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
270
40
0
28 Aug 2024
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yuhang Zhou
Jing Zhu
Paiheng Xu
Xiaoyu Liu
Xiyao Wang
Danai Koutra
Wei Ai
Furong Huang
259
6
0
19 Jun 2024
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Bingchen Zhao
Yongshuo Zong
Letian Zhang
Timothy Hospedales
VLM
255
37
0
18 Jun 2024
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
Darshana Saravanan
Darshan Singh
Varun Gupta
Zeeshan Khan
Vineet Gandhi
Makarand Tapaswi
CoGe
121
6
0
16 Jun 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
247
103
0
13 Jun 2024
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Rajatsubhra Chakraborty
Arkaprava Sinha
Dominick Reilly
Manish Kumar Govind
Pu Wang
Francois Bremond
Srijan Das
Srijan Das
144
2
0
13 Jun 2024
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
Weixi Feng
Jiachen Li
Michael Stephen Saxon
Tsu-Jui Fu
Wenhu Chen
William Yang Wang
EGVM
VGen
198
26
0
12 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGen
LRM
490
33
0
12 Jun 2024
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
Neural Information Processing Systems (NeurIPS), 2024
Peng Xia
Ze Chen
Juanxi Tian
Yangrui Gong
Ruibo Hou
...
Jimeng Sun
Zongyuan Ge
Gang Li
James Zou
Huaxiu Yao
MU
VLM
236
65
0
10 Jun 2024
1
2
Next