Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,273 papers shown
Title
Reinforcement Learning for Large Model: A Survey
Weijia Wu
Chen Gao
Joya Chen
Kevin Lin
Qingwei Meng
Yiming Zhang
Yuke Qiu
Hong Zhou
Mike Zheng Shou
273
2
0
24 Dec 2025
PAI-Bench: A Comprehensive Benchmark For Physical AI
Fengzhe Zhou
Jiannan Huang
Jialuo Li
Deva Ramanan
Humphrey Shi
VGen
116
0
0
01 Dec 2025
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
Zeqing Wang
Keze Wang
Lei Zhang
VGen
84
0
0
01 Dec 2025
CauSight: Learning to Supersense for Visual Causal Discovery
Yize Zhang
M. Chen
Sirui Chen
Bo Peng
Y. Zhang
Tianyu Li
Chaochao Lu
CML
ReLM
LRM
121
0
0
01 Dec 2025
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
Zhongyu Yang
Dannong Xu
Wei Pang
Yingfang Yuan
VLM
108
0
0
01 Dec 2025
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen
Zhuoran Yu
Samuel Low Yu Hang
Subin An
J. Lee
...
SeungEun Chung
Thanh-Huy Nguyen
JuWan Maeng
Soochahn Lee
Yong Jae Lee
AuLLM
VLM
165
0
0
01 Dec 2025
REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
Jacob Thompson
Emiliano Garcia-Lopez
Yonatan Bisk
LRM
88
0
0
30 Nov 2025
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Haruki Sakajo
Hiroshi Takato
Hiroshi Tsutsui
Komei Soda
Hidetaka Kamigaito
Taro Watanabe
MLLM
108
0
0
28 Nov 2025
WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Eun Chang
Z. Huang
Yiwei Liao
Sagar Ravi Bhavsar
Amogh Param
...
Babak Damavandi
Rakesh Wanga
Anuj Kumar
Rohit Patel
Xin Luna Dong
32
0
0
27 Nov 2025
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni
Marco Morini
Sara Sarto
Federico Cocchi
Davide Caffagni
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
RALM
LRM
163
0
0
27 Nov 2025
Unexplored flaws in multiple-choice VQA evaluations
Fabio Rosenthal
Sebastian Schmidt
Thorsten Graf
Thorsten Bagodonat
Stephan Günnemann
Leo Schwinn
24
0
0
27 Nov 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
113
0
0
25 Nov 2025
Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models
Souradeep Dutta
Keshav Bulia
Neena S Nair
VLM
115
0
0
25 Nov 2025
Object-Centric Vision Token Pruning for Vision Language Models
Guangyuan Li
R. Zhao
Jinhong Deng
Yanbo Wang
Joni Pajarinen
VLM
136
0
0
25 Nov 2025
Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
Federico Felizzi
Olivia Riccomi
Michele Ferramola
Francesco Andrea Causio
Manuel Del Medico
...
Antonio Cristiano
Alessia Longo
Luigi De Angelis
Mariapia Vassalli
Marcello Di Pumpo
112
0
0
24 Nov 2025
Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
Jonathan Lee
Xingrui Wang
Jiawei Peng
Luoxin Ye
Zehan Zheng
...
Wufei Ma
S. Chen
Yu-Cheng Chou
Prakhar Kaushik
Alan Yuille
LRM
67
0
0
24 Nov 2025
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
Wengyi Zhan
Mingbao Lin
Zhihang Lin
Rongrong Ji
MLLM
VLM
LRM
195
0
0
24 Nov 2025
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Zhaolong Su
Wang Lu
Hao Chen
Sharon Li
Jindong Wang
124
0
0
24 Nov 2025
Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning
Zhaoqi Xu
Yingying Zhang
Jian Li
Jianwei Guo
Qiannan Zhu
Hua Huang
VLM
52
0
0
24 Nov 2025
Quantifying Modality Contributions via Disentangling Multimodal Representations
Padegal Amit
Omkar Mahesh Kashyap
Namitha Rayasam
Nidhi Shekhar
Surabhi Narayan
100
0
0
22 Nov 2025
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo
Serena Yeung-Levy
LRM
221
0
0
21 Nov 2025
Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Daniil Ignatev
Ayman Santeer
Albert Gatt
Denis Paperno
140
0
0
21 Nov 2025
Attention Guided Alignment in Efficient Vision-Language Models
Shweta Mahajan
Hoang Le
Hyojin Park
Farzad Farhadzadeh
Munawar Hayat
Fatih Porikli
VLM
100
0
0
21 Nov 2025
Solving Spatial Supersensing Without Spatial Supersensing
Vishaal Udandarao
Shyamgopal Karthik
Surabhi S. Nath
Andreas Hochlehnert
Matthias Bethge
Ameya Prabhu
65
0
0
20 Nov 2025
Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov
Ulyana Isaeva
Anton A. Emelyanov
Artem Safin
Maria Tikhonova
...
Ilseyar Alimova
Ilseyar Alimova
A. Kapitanov
Alena Fenogenova
Alena Fenogenova
262
1
0
19 Nov 2025
HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
Rishikant Chigrupaatii
Ponnada Sai Tulasi Kanishka
Lalit Chandra Routhu
Martin Patel Sama Supratheek Reddy
Divyam Gupta
Dasari Srikar
Krishna Teja Kuchimanchi
Rajiv Misra
Rohun Tripathi
137
0
0
19 Nov 2025
Parameter Importance-Driven Continual Learning for Foundation Models
LingXiang Wang
Hainan Zhang
Zhiming Zheng
KELM
CLL
426
0
0
19 Nov 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li
Mingyu Gao
Tonghua Su
Xu-Yao Zhang
Zhongjie Wang
CLL
268
0
0
19 Nov 2025
SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
International Journal of Human-Computer Interaction (IJHCI), 2025
Chen Chen
Cuong Nguyen
Alexa Siu
Dingzeyu Li
Nadir Weibel
252
0
0
18 Nov 2025
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Jiangnan Ye
Jiedong Zhuang
Lianrui Mu
Wenjie Zheng
Jiaqi Hu
Xingze Zou
Jing Wang
Haoji Hu
3DGS
152
0
0
17 Nov 2025
Explore How to Inject Beneficial Noise in MLLMs
Ruishu Zhu
Sida Huang
Ziheng Jiao
Hongyuan Zhang
168
3
0
17 Nov 2025
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Kaiwen Xue
Chenglong Li
Zhonghong Ou
Guoxin Zhang
Kaoyan Lu
...
Xinyu Liu
Qunlin Chen
Weiwei Qin
Yiran Shen
Jiayi Cen
96
0
0
17 Nov 2025
CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training
Jiahe Qian
Yuhao Shen
Zhangtianyi Chen
Juexiao Zhou
Peisong Wang
132
0
0
16 Nov 2025
OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description
Quanxing Xu
Ling Zhou
Feifei Zhang
Jinyu Tian
Rubing Huang
VLM
184
0
0
15 Nov 2025
TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models
Wenhao Zhou
Hao Zheng
R. Zhao
MLLM
VLM
LRM
156
0
0
14 Nov 2025
FaithAct: Faithfulness Planning and Acting in MLLMs
Junxian Li
Xinyue Xu
Sai Ma
Sichao Li
LRM
112
1
0
11 Nov 2025
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
Leyan Xue
Zongbo Han
Kecheng Xue
Xiaohong Liu
Guangyu Wang
C. Zhang
112
0
0
09 Nov 2025
Unveiling Modality Bias: Automated Sample-Specific Analysis for Multimodal Misinformation Benchmarks
Hehai Lin
Hui Liu
S. Cao
Jing Li
Haoliang Li
Wenya Wang
196
0
0
08 Nov 2025
Visual Spatial Tuning
Rui Yang
Ziyu Zhu
Yanwei Li
Jingjia Huang
Shen Yan
...
Xiangtai Li
S. Li
Wenqian Wang
Yi Lin
Hengshuang Zhao
VLM
325
5
0
07 Nov 2025
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis L Brown
Jihan Yang
Shusheng Yang
Rob Fergus
Saining Xie
VLM
226
5
0
06 Nov 2025
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Ali Faraz
Akash
Shaharukh Khan
Raja Kolla
Akshat Patidar
Suranjan Goswami
Abhinav Ravi
Chandra Khatri
Shubham Agarwal
VLM
156
0
0
06 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
288
1
0
06 Nov 2025
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis L Brown
Arijit Ray
Ranjay Krishna
Ross B. Girshick
Rob Fergus
Saining Xie
301
6
0
06 Nov 2025
CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Jizheng Ma
Xiaofei Zhou
Yanlong Song
Han Yan
VLM
LRM
153
1
0
04 Nov 2025
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning
Chenyu Zhang
Minsol Kim
Shohreh Ghorbani
Jingyao Wu
Rosalind Picard
Patricia Maes
Paul Pu Liang
118
1
0
04 Nov 2025
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
Mengjie Deng
Guanting Dong
Zhicheng Dou
100
1
0
31 Oct 2025
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang
Jinxin Ke
Xiaoxuan Fan
Yufeng Yang
Yang Liu
...
Junteng Dai
Haoyi Jiang
Y. Zhou
Keze Wang
Z. Chen
LRM
VLM
311
0
0
30 Oct 2025
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang
X. J. Yang
Kai Sun
Parth Suresh
Sanat Sharma
...
Rakesh Wanga
Anuj Kumar
Rohit Patel
Wen-tau Yih
Xin Luna Dong
116
0
0
30 Oct 2025
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
Wenli Xiao
Haotian Lin
Andy Peng
Haoru Xue
Tairan He
...
Jimmy Wu
Zhengyi Luo
Linxi Fan
Guanya Shi
Yuke Zhu
VLM
470
4
0
30 Oct 2025
What do vision-language models see in the context? Investigating multimodal in-context learning
G. O. D. Santos
Esther Colombini
Sandra Avila
92
0
0
28 Oct 2025
1
2
3
4
...
44
45
46
Next