Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,262 papers shown
Title
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
81
0
0
25 Nov 2025
Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models
Souradeep Dutta
Keshav Bulia
Neena S Nair
VLM
91
0
0
25 Nov 2025
Object-Centric Vision Token Pruning for Vision Language Models
Guangyuan Li
R. Zhao
Jinhong Deng
Yanbo Wang
Joni Pajarinen
VLM
116
0
0
25 Nov 2025
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Zhaolong Su
Wang Lu
Hao Chen
Sharon Li
Jindong Wang
88
0
0
24 Nov 2025
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
Wengyi Zhan
Mingbao Lin
Zhihang Lin
Rongrong Ji
MLLM
VLM
LRM
171
0
0
24 Nov 2025
Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning
Zhaoqi Xu
Yingying Zhang
Jian Li
Jianwei Guo
Qiannan Zhu
Hua Huang
VLM
40
0
0
24 Nov 2025
Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
Jonathan Lee
Xingrui Wang
Jiawei Peng
Luoxin Ye
Zehan Zheng
...
Wufei Ma
S. Chen
Yu-Cheng Chou
Prakhar Kaushik
Alan Yuille
LRM
51
0
0
24 Nov 2025
Quantifying Modality Contributions via Disentangling Multimodal Representations
Padegal Amit
Omkar Mahesh Kashyap
Namitha Rayasam
Nidhi Shekhar
Surabhi Narayan
72
0
0
22 Nov 2025
Attention Guided Alignment in Efficient Vision-Language Models
Shweta Mahajan
Hoang Le
Hyojin Park
Farzad Farhadzadeh
Munawar Hayat
Fatih Porikli
VLM
80
0
0
21 Nov 2025
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo
Serena Yeung-Levy
LRM
173
0
0
21 Nov 2025
Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Daniil Ignatev
Ayman Santeer
Albert Gatt
Denis Paperno
84
0
0
21 Nov 2025
Solving Spatial Supersensing Without Spatial Supersensing
Vishaal Udandarao
Shyamgopal Karthik
Surabhi S. Nath
Andreas Hochlehnert
Matthias Bethge
Ameya Prabhu
45
0
0
20 Nov 2025
Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov
Ulyana Isaeva
Anton A. Emelyanov
Artem Safin
Maria Tikhonova
...
Ilseyar Alimova
Ilseyar Alimova
A. Kapitanov
Alena Fenogenova
Alena Fenogenova
206
1
0
19 Nov 2025
Parameter Importance-Driven Continual Learning for Foundation Models
LingXiang Wang
Hainan Zhang
Zhiming Zheng
KELM
CLL
362
0
0
19 Nov 2025
HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
Rishikant Chigrupaatii
Ponnada Sai Tulasi Kanishka
Lalit Chandra Routhu
Martin Patel Sama Supratheek Reddy
Divyam Gupta
Dasari Srikar
Krishna Teja Kuchimanchi
Rajiv Misra
Rohun Tripathi
129
0
0
19 Nov 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li
Mingyu Gao
Tonghua Su
Xu-Yao Zhang
Zhongjie Wang
CLL
256
0
0
19 Nov 2025
SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
Chen Chen
Cuong Nguyen
Alexa Siu
Dingzeyu Li
Nadir Weibel
204
0
0
18 Nov 2025
Explore How to Inject Beneficial Noise in MLLMs
Ruishu Zhu
Sida Huang
Ziheng Jiao
Hongyuan Zhang
96
3
0
17 Nov 2025
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Kaiwen Xue
Chenglong Li
Zhonghong Ou
Guoxin Zhang
Kaoyan Lu
...
Xinyu Liu
Qunlin Chen
Weiwei Qin
Yiran Shen
Jiayi Cen
80
0
0
17 Nov 2025
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Jiangnan Ye
Jiedong Zhuang
Lianrui Mu
Wenjie Zheng
Jiaqi Hu
Xingze Zou
Jing Wang
Haoji Hu
3DGS
124
0
0
17 Nov 2025
CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training
Jiahe Qian
Yuhao Shen
Zhangtianyi Chen
Juexiao Zhou
Peisong Wang
104
0
0
16 Nov 2025
OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description
Quanxing Xu
Ling Zhou
Feifei Zhang
Jinyu Tian
Rubing Huang
VLM
120
0
0
15 Nov 2025
TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models
Wenhao Zhou
Hao Zheng
R. Zhao
MLLM
VLM
LRM
140
0
0
14 Nov 2025
FaithAct: Faithfulness Planning and Acting in MLLMs
Junxian Li
Xinyue Xu
Sai Ma
Sichao Li
LRM
84
1
0
11 Nov 2025
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
Leyan Xue
Zongbo Han
Kecheng Xue
Xiaohong Liu
Guangyu Wang
C. Zhang
100
0
0
09 Nov 2025
Unveiling Modality Bias: Automated Sample-Specific Analysis for Multimodal Misinformation Benchmarks
Hehai Lin
Hui Liu
S. Cao
Jing Li
Haoliang Li
Wenya Wang
172
0
0
08 Nov 2025
Visual Spatial Tuning
Rui Yang
Ziyu Zhu
Yanwei Li
Jingjia Huang
Shen Yan
...
Xiangtai Li
S. Li
Wenqian Wang
Yi Lin
Hengshuang Zhao
VLM
297
4
0
07 Nov 2025
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis L Brown
Arijit Ray
Ranjay Krishna
Ross B. Girshick
Rob Fergus
Saining Xie
261
6
0
06 Nov 2025
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Ali Faraz
Akash
Shaharukh Khan
Raja Kolla
Akshat Patidar
Suranjan Goswami
Abhinav Ravi
Chandra Khatri
Shubham Agarwal
VLM
124
0
0
06 Nov 2025
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis L Brown
Jihan Yang
Shusheng Yang
Rob Fergus
Saining Xie
VLM
202
4
0
06 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
248
1
0
06 Nov 2025
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning
Chenyu Zhang
Minsol Kim
Shohreh Ghorbani
Jingyao Wu
Rosalind Picard
Patricia Maes
Paul Pu Liang
90
1
0
04 Nov 2025
CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Jizheng Ma
Xiaofei Zhou
Yanlong Song
Han Yan
VLM
LRM
137
0
0
04 Nov 2025
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
Mengjie Deng
Guanting Dong
Zhicheng Dou
72
1
0
31 Oct 2025
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
Wenli Xiao
Haotian Lin
Andy Peng
Haoru Xue
Tairan He
...
Jimmy Wu
Zhengyi Luo
Linxi Fan
Guanya Shi
Yuke Zhu
VLM
378
3
0
30 Oct 2025
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang
X. J. Yang
Kai Sun
Parth Suresh
Sanat Sharma
...
Rakesh Wanga
Anuj Kumar
Rohit Patel
Wen-tau Yih
Xin Luna Dong
104
0
0
30 Oct 2025
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang
Jinxin Ke
Xiaoxuan Fan
Yufeng Yang
Yang Liu
...
Junteng Dai
Haoyi Jiang
Y. Zhou
Keze Wang
Z. Chen
LRM
VLM
267
0
0
30 Oct 2025
What do vision-language models see in the context? Investigating multimodal in-context learning
G. O. D. Santos
Esther Colombini
Sandra Avila
68
0
0
28 Oct 2025
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
Xin Jin
Siyuan Li
Siyong Jian
Kai Yu
Huan Wang
96
0
0
27 Oct 2025
A Video Is Not Worth a Thousand Words
Sam Pollard
Michael Wray
72
0
0
27 Oct 2025
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya J. Bajpai
M. Hanawal
MLLM
VLM
182
0
0
26 Oct 2025
OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models
Hao Zheng
Zirui Pang
Ling Li
Zhijie Deng
Yuhan Pu
Zhaowei Zhu
Xiaobo Xia
Jiaheng Wei
MU
146
0
0
26 Oct 2025
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments
Weijie Zhou
Xuantang Xiong
Yi Peng
Manli Tao
Chaoyang Zhao
Honghui Dong
Ming Tang
Jinqiao Wang
LRM
85
0
0
24 Oct 2025
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
45
0
0
24 Oct 2025
HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
Zelin Peng
Zhengqin Xu
Qingyang Liu
Xiaokang Yang
Wei Shen
129
0
0
23 Oct 2025
I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
John Burden
Jonathan Prunty
Ben Slater
Matthieu Tehenan
Greg Davis
Lucy G. Cheke
LRM
108
0
0
22 Oct 2025
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
112
1
0
22 Oct 2025
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu
Kaicheng Yang
Ziyang Gong
Qi Ming
Zonghao Guo
Xiang An
Ziyong Feng
Junchi Yan
Xue Yang
CLIP
VLM
171
0
0
21 Oct 2025
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei
Yaofeng Sun
Yukun Li
VLM
124
17
0
21 Oct 2025
UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Da Zhang
Chenggang Rong
Bingyu Li
Feiyu Wang
Zhiyuan Zhao
Junyu Gao
Xuelong Li
VLM
CoGe
155
0
0
21 Oct 2025
1
2
3
4
...
44
45
46
Next