ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,277 papers shown
Title
From Global to Local: Social Bias Transfer in CLIP
From Global to Local: Social Bias Transfer in CLIP
Ryan Ramos
Yusuke Hirota
Yuta Nakashima
Noa Garcia
112
0
0
25 Aug 2025
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
Pengfei Jiang
Hanjun Li
Linglan Zhao
Fei Chao
Ke Yan
Shouhong Ding
Rongrong Ji
116
2
0
25 Aug 2025
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Sixun Dong
Juhua Hu
Mian Zhang
Ming Yin
Yanjie Fu
Qi Qian
104
4
0
25 Aug 2025
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Fucai Ke
Joy Hsu
Zhixi Cai
Zixian Ma
Xin Zheng
...
P. D. Haghighi
Gholamreza Haffari
Ranjay Krishna
Jiajun Wu
H. Rezatofighi
ReLMCoGeLRM
344
8
0
24 Aug 2025
PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science
PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science
Syed Nazmus Sakib
Nafiul Haque
Mohammad Zabed Hossain
Shifat E. Arman
124
1
0
23 Aug 2025
Can VLMs Recall Factual Associations From Visual References?
Can VLMs Recall Factual Associations From Visual References?
Dhananjay Ashok
Ashutosh Chaubey
Hirona J. Arai
Jonathan May
Jesse Thomason
76
0
0
22 Aug 2025
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
Yichi Zhang
Yao Huang
Yifan Wang
Yitong Sun
Chang-rui Liu
...
Xiao Yang
Xingxing Wei
Hang Su
Yinpeng Dong
Jun Zhu
158
1
0
21 Aug 2025
GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
Abhigya Verma
Sriram Puttagunta
Seganrasan Subramanian
Sravan Ramachandran
96
1
0
21 Aug 2025
Mitigating Easy Option Bias in Multiple-Choice Question Answering
Mitigating Easy Option Bias in Multiple-Choice Question Answering
Hao Zhang
Chen Li
Basura Fernando
AAML
124
0
0
19 Aug 2025
Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Thanh-Dat Truong
Huu-Thien Tran
Tran Thai Son
Bhiksha Raj
Khoa Luu
282
1
0
19 Aug 2025
Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
Tan-Hanh Pham
Chris Ngo
LRM
157
3
0
18 Aug 2025
RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts
RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts
Xuming He
Zhiyuan You
Junchao Gong
Couhua Liu
Xiaoyu Yue
Peiqin Zhuang
Wenlong Zhang
Wenlong Zhang
88
3
0
17 Aug 2025
Dataset Creation for Visual Entailment using Generative AI
Dataset Creation for Visual Entailment using Generative AI
Rob Reijtenbach
Suzan Verberne
Gijs Wijnholds
SyDaCoGe
101
0
0
15 Aug 2025
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Keishi Ishihara
Kento Sasaki
Tsubasa Takahashi
Daiki Shiono
Yu Yamaguchi
85
2
0
14 Aug 2025
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Wenbin An
Jiahao Nie
Yaqiang Wu
Feng Tian
Shijian Lu
Q. Zheng
MLLM
170
1
0
14 Aug 2025
ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks
ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks
Abhishek Kolari
Mohammadhossein Khojasteh
Yifan Jiang
Floris den Hengst
Filip Ilievski
OCL
169
0
0
14 Aug 2025
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard
Mehrzad Mohammadi
Yi Shen
Zhixi Cai
Hamid Rezatofighi
277
1
0
14 Aug 2025
Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Ayushman Sarkar
Mohd Yamani Idna Idris
Zhenyu Yu
LRM
160
10
0
14 Aug 2025
3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs
3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs
Noor Ahmed
Cameron Braunstein
Steffen Eger
Eddy Ilg
92
1
0
12 Aug 2025
MolmoAct: Action Reasoning Models that can Reason in Space
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee
Jiafei Duan
Haoquan Fang
Yuquan Deng
Shuo Liu
...
Karen Farley
Eli VanderBilt
Ali Farhadi
Dieter Fox
Ranjay Krishna
LM&RoLRM
413
48
0
11 Aug 2025
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Yanshu Li
JianJiang Yang
Zhennan Shen
Ligong Han
Haoyan Xu
Ruixiang Tang
VLM
309
7
0
11 Aug 2025
Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models
Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models
Chenyue Song
C. Hui
Haiqi Zhu
Feng Jiang
Yachun Mi
Wei Zhang
Shaohui Liu
96
2
0
11 Aug 2025
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
Siminfar Samakoush Galougah
Rishie Raj
Sanjoy Chowdhury
Sayan Nag
Ramani Duraiswami
181
3
0
10 Aug 2025
MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
Haiyang Guo
Fei Zhu
Hongbo Zhao
Fanhu Zeng
Wenzhuo Liu
Shijie Ma
Da-Han Wang
Xu-Yao Zhang
CLL
198
2
0
10 Aug 2025
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
Jianting Tang
Yubo Wang
Haoyu Cao
Linli Xu
88
0
0
09 Aug 2025
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Huanyu Wang
Jushi Kai
Haoli Bai
Lu Hou
Bo Jiang
Ziwei He
Zhouhan Lin
VLM
94
0
0
08 Aug 2025
CountQA: How Well Do MLLMs Count in the Wild?
CountQA: How Well Do MLLMs Count in the Wild?
Jayant Sravan Tamarapalli
Rynaa Grover
Nilay Pande
Sahiti Yerramilli
148
6
0
08 Aug 2025
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen
Ruihui Zhao
Chuwei Luo
Mingze Sun
Xinlei Yu
Yangyang Kang
Ruqi Huang
LRM
243
4
0
08 Aug 2025
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang
Runsen Xu
Chenhang Cui
Tai Wang
Dahua Lin
Jiangmiao Pang
128
2
0
07 Aug 2025
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xin Guan
Peng Xia
Zhen Zhang
Xinyu Wang
Qiuchen Wang
...
Kuan Li
Yong Jiang
Pengjun Xie
Fei Huang
Jingren Zhou
315
31
0
07 Aug 2025
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Xu Yuan
Liangbo Ning
Wenqi Fan
Qing Li
171
2
0
07 Aug 2025
A Metric for MLLM Alignment in Large-scale Recommendation
A Metric for MLLM Alignment in Large-scale Recommendation
Yubin Zhang
Yanhua Huang
Haiming Xu
Mingliang Qi
Chang Wang
Jiarui Jin
Xiangyuan Ren
Xiaodan Wang
Ruiwen Xu
OffRL
93
0
0
07 Aug 2025
Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object CompletionInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Qingguo Hu
Ante Wang
Jia Song
Delai Qiu
Qingsong Liu
Jinsong Su
VLMLRM
122
1
0
06 Aug 2025
FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
Emmanuelle Bourigault
Pauline Bourigault
VLM
100
1
0
06 Aug 2025
Evaluating Variance in Visual Question Answering Benchmarks
Evaluating Variance in Visual Question Answering Benchmarks
Nikitha SR
LRM
148
0
0
04 Aug 2025
MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
Chenxi Li
Yichen Guo
Benfang Qian
Jinhao You
Kai Tang
Yaosong Du
Zonghao Zhang
Xiande Huang
MLLM
211
1
0
03 Aug 2025
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
Quan-Sheng Zeng
Yunheng Li
Qilong Wang
Peng-Tao Jiang
Zuxuan Wu
Ming-Ming Cheng
Qibin Hou
VLM
155
2
0
03 Aug 2025
HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models
Jizhihui Liu
Feiyi Du
Guangdao Zhu
Niu Lian
Jun Li
Bin Chen
VLM
110
2
0
01 Aug 2025
Session-Based Recommendation with Validated and Enriched LLM Intents
Session-Based Recommendation with Validated and Enriched LLM Intents
G. G. Lee
Y. Liu
Yifan Liu
Susik Yoon
Dong Wang
SeongKu Kang
167
2
0
01 Aug 2025
CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
Shixin Yi
Lin Shang
ReLMLRM
105
0
0
01 Aug 2025
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Weide Liu
Wei Zhou
Jun Liu
Ping Hu
Jun Cheng
Jungong Han
Weisi Lin
3DV
207
3
0
30 Jul 2025
Goal-Based Vision-Language Driving
Goal-Based Vision-Language Driving
Santosh Patapati
Trisanth Srinivasan
161
0
0
30 Jul 2025
Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
Te Zhang
Yuheng Li
Junxiang Wang
Lujun Li
120
0
0
28 Jul 2025
Analyzing the Sensitivity of Vision Language Models in Visual Question Answering
Analyzing the Sensitivity of Vision Language Models in Visual Question Answering
Monika Shah
Sudarshan Balaji
Somdeb Sarkhel
Sanorita Dey
Deepak Venugopal
111
1
0
28 Jul 2025
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Ao Li
Yuxiang Duan
Jinghui Zhang
Congbo Ma
Yutong Xie
G. Carneiro
Mohammad Yaqub
Hu Wang
131
0
0
28 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
489
10
0
27 Jul 2025
Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation
Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation
Xinshu Li
Ruoyu Wang
Erdun Gao
Mingming Gong
Lina Yao
DiffM
171
0
0
26 Jul 2025
VisionTrap: Unanswerable Questions On Visual Data
VisionTrap: Unanswerable Questions On Visual Data
Asir Saadat
Syem Aziz
Shahriar Mahmud
Abdullah Ibne Masud Mahi
Sabbir Ahmed
118
0
0
23 Jul 2025
FedVLM: Scalable Personalized Vision-Language Models through Federated Learning
FedVLM: Scalable Personalized Vision-Language Models through Federated Learning
Arkajyoti Mitra
Afia Anjum
Paul Agbaje
Mert D. Pesé
Habeeb Olufowobi
VLM
186
2
0
23 Jul 2025
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
Duong T. Tran
T. Tran
M. Hauswirth
Danh Le-Phuoc
189
2
0
22 Jul 2025
Previous
12345...444546
Next