ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.10529
  4. Cited By
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
  Reasoning over Image Sequences
v1v2 (latest)

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

19 January 2024
Xiyao Wang
Yuhang Zhou
Xiaoyu Liu
Hongjin Lu
Yuancheng Xu
Feihong He
Jaehong Yoon
Taixi Lu
Gedas Bertasius
Mohit Bansal
Huaxiu Yao
Furong Huang
    LRMVLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (30★)

Papers citing "Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences"

50 / 67 papers shown
Title
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
Ming Zhong
Y. Wang
Liuzhou Zhang
Arctanx An
Renrui Zhang
Hao Liang
Ming Lu
Ying Shen
Wentao Zhang
144
0
0
22 Nov 2025
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Aakriti Agrawal
Gouthaman KV
R. Aralikatti
Gauri Jagatap
Jiaxin Yuan
Vijay Kamarshi
Andrea Fanelli
Furong Huang
VLM
116
0
0
07 Nov 2025
NVIDIA Nemotron Nano V2 VL
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
264
1
0
06 Nov 2025
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Yongyuan Liang
Wei Chow
Feng Li
Ziqiao Ma
Xiyao Wang
Jiageng Mao
Jiuhai Chen
Jiatao Gu
Y. Wang
Furong Huang
LRM
188
1
0
03 Nov 2025
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Gowreesh Mago
Pascal Mettes
Stevan Rudinac
120
0
0
28 Aug 2025
Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
Jianyi Zhang
Xu Ji
Ziyin Zhou
Yuchen Zhou
Shubo Shi
Haoyu Wu
Zhen Li
Shizhao Liu
ReLMCoGeLRMVLM
134
1
0
01 Aug 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria
Adinath Madhavrao Dukre
Feilong Tang
Sara Atito
Sudipta Roy
Muhammad Awais
Muhammad Haris Khan
Imran Razzak
VLM
223
0
0
18 Jun 2025
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yingjin Song
Yupei Du
Denis Paperno
Albert Gatt
MLLM
217
1
0
12 Jun 2025
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Liangliang You
Junchi Yao
Shu Yang
Guimin Hu
Lijie Hu
Di Wang
MLLM
219
2
0
08 Jun 2025
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou
Le Xue
Honglu Zhou
Silvio Savarese
Ran Xu
Caiming Xiong
Chris Callison-Burch
Mark Yatskar
Juan Carlos Niebles
243
0
0
02 Jun 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Qi Zhang
Tat-Seng Chua
Tianwei Zhang
ALMELM
480
21
0
26 Apr 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Aviv Slobodkin
Hagai Taitelbaum
Yonatan Bitton
Brian Gordon
Michal Sokolik
Nitzan Bitton-Guetta
Almog Gueta
Royi Rassin
Itay Laish
Dani Lischinski
EGVMVGen
324
1
0
24 Apr 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Xinze Wang
Zhiyong Yang
Chao Feng
Hongjin Lu
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Furong Huang
Lijuan Wang
OODDReLMLRMVLM
542
68
0
10 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Jiuxiang Gu
Franck Dernoncourt
Wanrong Zhu
Wanrong Zhu
Tianyi Zhou
Tong Sun
383
12
0
07 Apr 2025
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
Yongyi Zang
Sean O'Brien
Taylor Berg-Kirkpatrick
Julian McAuley
Cheng-i Wang
AuLLM
301
10
0
01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
392
6
0
29 Mar 2025
Aligning Multimodal LLM with Human Preference: A Survey
Aligning Multimodal LLM with Human Preference: A Survey
Tao Yu
Yujiao Shi
Chaoyou Fu
Junkang Wu
Jinda Lu
...
Qingsong Wen
Zheng Zhang
Yan Huang
Liang Wang
Tieniu Tan
769
12
0
18 Mar 2025
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Yiqi Zhu
Zihan Wang
Chen Zhang
Ziwei Sun
Yang Liu
CoGeVLM
209
2
0
18 Mar 2025
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models
Boyu Jia
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
LRM
187
5
0
03 Mar 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual ContextsComputer Vision and Pattern Recognition (CVPR), 2025
Peijie Wang
Zhong-Zhi Li
Fei Yin
Xin Yang
Dekang Ran
Cheng-Lin Liu
LRM
478
26
0
28 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLMVLM
510
12
0
26 Feb 2025
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Danae Sánchez Villegas
Ingo Ziegler
Desmond Elliott
LRM
258
4
0
26 Feb 2025
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
1.0K
0
0
18 Feb 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hugh Mee Wong
Rick Nouwen
Albert Gatt
400
0
0
17 Feb 2025
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Haibo Tong
Zhaoyang Wang
Zhe Chen
Haonian Ji
Shi Qiu
...
Peng Xia
Mingyu Ding
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVMVGen
579
8
0
03 Feb 2025
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Yuhang Zhou
Giannis Karamanolakis
Victor Soto
Anna Rumshisky
Mayank Kulkarni
Furong Huang
Wei Ai
Jianhua Lu
MoMe
457
5
0
03 Feb 2025
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language EmbeddingComputer Vision and Pattern Recognition (CVPR), 2024
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLMVLM
443
5
0
20 Dec 2024
PVC: Progressive Visual Token Compression for Unified Image and Video
  Processing in Large Vision-Language Models
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Chenyu Yang
Xuan Dong
X. Zhu
Weijie Su
Jiahao Wang
H. Tian
Zheyu Chen
Wenhai Wang
Lewei Lu
Jifeng Dai
VLM
192
9
0
12 Dec 2024
Large Language Model Benchmarks in Medical Tasks
Large Language Model Benchmarks in Medical Tasks
Lawrence K. Q. Yan
Ming Li
Yujiao Shi
Cheng Fei
Cheng Fei
...
Junyu Liu
Xinyuan Song
Riyang Bao
Zekun Jiang
Ziyuan Qin
LM&MAAI4MH
567
18
0
28 Oct 2024
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation
  Models
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation ModelsConference on Robot Learning (CoRL), 2024
Nils Blank
Moritz Reuss
Marcel Rühle
Ömer Erdinç Yagmurlu
Fabian Wenzel
Oier Mees
Rudolf Lioutikov
LM&RoOffRL
252
13
0
23 Oct 2024
Mitigating Object Hallucination via Concentric Causal Attention
Mitigating Object Hallucination via Concentric Causal AttentionNeural Information Processing Systems (NeurIPS), 2024
Yun Xing
Yiheng Li
Ivan Laptev
Shijian Lu
214
38
0
21 Oct 2024
A Survey of Hallucination in Large Visual Language Models
A Survey of Hallucination in Large Visual Language Models
Wei Lan
Wenyi Chen
Qingfeng Chen
Shirui Pan
Huiyu Zhou
Yi-Lun Pan
LRM
299
11
0
20 Oct 2024
Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
Seulbi Lee
J. Kim
Sangheum Hwang
LRM
938
3
0
19 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
368
12
0
16 Oct 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Peng Xia
Siwei Han
Shi Qiu
Yiyang Zhou
Zhaoyang Wang
...
Chenhang Cui
Mingyu Ding
Linjie Li
Lijuan Wang
Huaxiu Yao
275
28
0
14 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLMMLLM
333
64
0
10 Oct 2024
VHELM: A Holistic Evaluation of Vision Language Models
VHELM: A Holistic Evaluation of Vision Language ModelsNeural Information Processing Systems (NeurIPS), 2024
Tony Lee
Haoqin Tu
Chi Heem Wong
Wenhao Zheng
Yiyang Zhou
...
Josselin Somerville Roberts
Michihiro Yasunaga
Huaxiu Yao
Cihang Xie
Abigail Z. Jacobs
VLM
273
41
0
09 Oct 2024
LLaVA-Critic: Learning to Evaluate Multimodal Models
LLaVA-Critic: Learning to Evaluate Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Tianyi Xiong
Xinze Wang
Dong Guo
Qinghao Ye
Haoqi Fan
Quanquan Gu
Heng Huang
Chunyuan Li
MLLMVLMLRM
310
91
0
03 Oct 2024
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMsInternational Conference on Learning Representations (ICLR), 2024
Hong Li
Nanxi Li
Yuanjie Chen
Jianbin Zhu
Qinlu Guo
Cewu Lu
Yong-Lu Li
MLLM
256
3
0
02 Oct 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li
Guikun Chen
Hanrong Shi
Jun Xiao
Long Chen
319
23
0
21 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesNeural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGeVLM
443
5
0
19 Sep 2024
A Survey on Evaluation of Multimodal Large Language Models
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MAELMLRM
270
40
0
28 Aug 2024
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in
  Sequence-Level Knowledge Distillation
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge DistillationConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yuhang Zhou
Jing Zhu
Paiheng Xu
Xiaoyu Liu
Xiyao Wang
Danai Koutra
Wei Ai
Furong Huang
259
6
0
19 Jun 2024
Benchmarking Multi-Image Understanding in Vision and Language Models:
  Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Bingchen Zhao
Yongshuo Zong
Letian Zhang
Timothy Hospedales
VLM
255
37
0
18 Jun 2024
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
Darshana Saravanan
Darshan Singh
Varun Gupta
Zeeshan Khan
Vineet Gandhi
Makarand Tapaswi
CoGe
121
6
0
16 Jun 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image
  Understanding
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
247
103
0
13 Jun 2024
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Rajatsubhra Chakraborty
Arkaprava Sinha
Dominick Reilly
Manish Kumar Govind
Pu Wang
Francois Bremond
Srijan Das
Srijan Das
144
2
0
13 Jun 2024
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and
  Image-to-Video Generation
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
Weixi Feng
Jiachen Li
Michael Stephen Saxon
Tsu-Jui Fu
Wenhu Chen
William Yang Wang
EGVMVGen
198
26
0
12 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGenLRM
490
33
0
12 Jun 2024
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision
  Language Models
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language ModelsNeural Information Processing Systems (NeurIPS), 2024
Peng Xia
Ze Chen
Juanxi Tian
Yangrui Gong
Ruibo Hou
...
Jimeng Sun
Zongyuan Ge
Gang Li
James Zou
Huaxiu Yao
MUVLM
236
65
0
10 Jun 2024
12
Next