Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,262 papers shown
Title
Accelerating Vision Transformers with Adaptive Patch Sizes
Rohan Choudhury
JungEun Kim
Jeongseok Lee
Eunho Yang
László A. Jeni
Kishore Venkateshan
ViT
84
0
0
20 Oct 2025
FineVision: Open Data Is All You Need
Luis Wiedmann
Orr Zohar
Amir Mahla
Xiaohan Wang
Rui Li
Thibaud Frere
Leandro von Werra
Aritra Roy Gosthipaty
Andrés Marafioti
VLM
160
11
0
20 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
142
3
0
17 Oct 2025
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
Yuyang Hong
Jiaqi Gu
Qi Yang
Lubin Fan
Yue-bo Wu
Ying Wang
Kun Ding
Shiming Xiang
Jieping Ye
157
2
0
16 Oct 2025
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
Xin Jin
MLLM
260
0
0
16 Oct 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Haiwen Diao
Mingxuan Li
Silei Wu
Linjun Dai
Xiaohua Wang
Hanming Deng
Lewei Lu
Dahua Lin
Ziwei Liu
VLM
112
0
0
16 Oct 2025
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang
Rongmei Lin
Shiyang Li
Colin Lockard
Ritesh Sarkhel
Sanket Lokegaonkar
Jingbo Shang
Xifeng Yan
Nasser Zalmout
Xian Li
80
0
0
16 Oct 2025
Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning
Hongkuan Zhou
Lavdim Halilaj
Sebastian Monka
Stefan Schmid
Yuqicheng Zhu
Jingcheng Wu
Nadeem Nazer
Steffen Staab
VLM
108
0
0
15 Oct 2025
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
Wenwen Tong
Hewei Guo
Dongchuan Ran
Jiangnan Chen
Jiefan Lu
...
Dinghao Zhou
Guiping Zhong
Ken Zheng
Shiyin Kang
Lewei Lu
MLLM
AuLLM
VGen
VLM
380
3
0
15 Oct 2025
End-to-End Multi-Modal Diffusion Mamba
Chunhao Lu
Qiang Lu
Meichen Dong
Jake Luo
102
3
0
15 Oct 2025
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Run Luo
Xiaobo Xia
Lu Wang
Longze Chen
Renke Shan
Jing Luo
Min Yang
Tat-Seng Chua
VGen
196
4
0
15 Oct 2025
VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
A. Alfarano
L. Venturoli
D. Negueruela del Castillo
CoGe
VLM
148
0
0
14 Oct 2025
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Jiwan Kim
Kibum Kim
Sangwoo Seo
Chanyoung Park
VLM
128
0
0
14 Oct 2025
Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Jian Lan
Zhicheng Liu
Udo Schlegel
Raoyuan Zhao
Yihong Liu
Hinrich Schütze
Michael A. Hedderich
Thomas Seidl
VLM
103
0
0
13 Oct 2025
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue
H. Zhang
Xiangyu Zeng
Boyu Chen
Chenting Wang
...
Lu Dong
Kunpeng Du
Yi Wang
Limin Wang
Yali Wang
148
3
0
12 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
128
1
0
12 Oct 2025
CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
Yichen Yan
Ming Zhong
Qi Zhu
Xiaoling Gu
Jinpeng Chen
Huan Li
93
0
0
11 Oct 2025
LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Yushuo Zheng
Zicheng Zhang
Xiongkuo Min
Huiyu Duan
Guangtao Zhai
68
1
0
10 Oct 2025
Task-Aware Resolution Optimization for Visual Large Language Models
Weiqing Luo
Zhen Tan
Y. Li
Xinyu Zhao
Kwonjoon Lee
Behzad Dariush
Tianlong Chen
56
0
0
10 Oct 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian
Hao Li
Gen Luo
Xizhou Zhu
Weijie Su
...
Y. Liu
Lewei Lu
Wenhai Wang
Hongsheng Li
Jifeng Dai
89
1
0
09 Oct 2025
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu
Suchir Salhan
Andrew Caines
P. Buttery
VLM
80
0
0
09 Oct 2025
FedBook: A Unified Federated Graph Foundation Codebook with Intra-domain and Inter-domain Knowledge Modeling
Zhengyu Wu
Yinlin Zhu
Xunkai Li
Ziang Qiu
Rong-Hua Li
Guoren Wang
Chenghu Zhou
FedML
81
0
0
09 Oct 2025
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Jiayun Luo
Wan-Cyuan Fan
Lyuyang Wang
Xiangteng He
Tanzila Rahman
Purang Abolmaesumi
Leonid Sigal
LRM
116
0
0
09 Oct 2025
Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Tavish McDonald
Bo Lei
Stanislav Fort
B. Kailkhura
Brian Bartoldson
60
0
0
08 Oct 2025
Automated Repeatable Adversary Threat Emulation with Effects Language (EL)
Suresh Damodaran
Paul D. Rowe
AAML
84
8
0
07 Oct 2025
Visual Representations inside the Language Model
Benlin Liu
Amita Kamath
Madeleine Grunde-McLaughlin
Winson Han
Ranjay Krishna
106
2
0
06 Oct 2025
VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
Nonghai Zhang
Zeyu Zhang
Jiazi Wang
Yang Zhao
Hao Tang
CoGe
234
0
0
06 Oct 2025
The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning
Mayank Ravishankara
Varindra V. Persad Maharaj
ELM
133
0
0
05 Oct 2025
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou
Di Lu
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Xu Zheng
Linfeng Zhang
Xuming Hu
VLM
225
5
0
03 Oct 2025
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata
M. Emani
MLLM
VGen
LRM
VLM
149
1
0
02 Oct 2025
Mitigating Modal Imbalance in Multimodal Reasoning
Chen Henry Wu
Neil Kale
Aditi Raghunathan
LRM
100
0
0
02 Oct 2025
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
Nilay Naharas
Dang Nguyen
Nesihan Bulut
M. Bateni
Vahab Mirrokni
Baharan Mirzasoleiman
92
0
0
01 Oct 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Zichen Wen
Shaobo Wang
Yufa Zhou
J. Zhang
Qintong Zhang
...
Zhaorun Chen
Bin Wang
W. Li
Conghui He
Linfeng Zhang
VLM
104
6
0
01 Oct 2025
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Yuansen Liu
Haiming Tang
Jinlong Peng
Jiangning Zhang
Xiaozhong Ji
...
Chaoyou Fu
Chengjie Wang
Chengjie Wang
Xiaobin Hu
Shuicheng Yan
VLM
197
1
0
30 Sep 2025
TAP: Two-Stage Adaptive Personalization of Multi-task and Multi-Modal Foundation Models in Federated Learning
Seohyun Lee
Wenzhi Fang
Dong-Jun Han
Seyyedali Hosseinalipour
Christopher G. Brinton
96
0
0
30 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
375
7
0
29 Sep 2025
When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
Jinming Liu
Zhaoyang Jia
J. Li
Bin Li
Xin Jin
Wenjun Zeng
Yan Lu
64
0
0
29 Sep 2025
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Paul Gavrikov
Wei Lin
Muhammad Jehanzeb Mirza
Soumya Jahagirdar
Muhammad Huzaifa
Sivan Doveh
Serena Yeung-Levy
James R. Glass
Hilde Kuehne
CoGe
139
1
0
29 Sep 2025
Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
Youngeun Kim
Youjia Zhang
Huiling Liu
Aecheon Jung
Sunwoo Lee
Sungeun Hong
VLM
111
0
0
29 Sep 2025
LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models
Shubhang Bhatnagar
Andy Xu
Kar-Han Tan
Narendra Ahuja
MQ
146
0
0
28 Sep 2025
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models
Zhinan Xie
Peisong Wang
Jian Cheng
Jian Cheng
VLM
70
0
0
28 Sep 2025
Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability
Divya J. Bajpai
M. Hanawal
68
0
0
28 Sep 2025
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection
Mingfei Han
Haihong Hao
Jinxing Zhou
Zhihui Li
Yuhui Zheng
XueQing Deng
Linjie Yang
Xiaojun Chang
HILM
VLM
104
0
0
27 Sep 2025
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
Divyam Madaan
Varshan Muhunthan
Kyunghyun Cho
S. Chopra
73
1
0
27 Sep 2025
Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models
Junjie Li
Ziao Wang
Jianghong Ma
Xiaofeng Zhang
96
0
0
27 Sep 2025
AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
Junyang Zhang
Tianyi Zhu
Thierry Tambe
44
0
0
27 Sep 2025
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Ziheng Chi
Yifan Hou
Chenxi Pang
Shaobo Cui
Mubashara Akhtar
Mrinmaya Sachan
103
0
0
26 Sep 2025
REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model
Bo Li
Guanzhi Deng
Ronghao Chen
Junrong Yue
Shuo Zhang
Qinghua Zhao
Linqi Song
Lijie Wen
LRM
85
0
0
26 Sep 2025
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
Tianrun Xu
Haoda Jing
Y. Li
Yuquan Wei
Jun Feng
Guanyu Chen
Haichuan Gao
Tianren Zhang
Feng Chen
OffRL
71
0
0
25 Sep 2025
SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Yan Zhang
Jiaqing Lin
Miao Zhang
Kui Xiao
Xiaoju Hou
Yue Zhao
Ruoyao Xiao
82
0
0
25 Sep 2025
Previous
1
2
3
4
5
...
44
45
46
Next