ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,277 papers shown
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
Yicheng Xiao
Lin Song
Rui Yang
Cheng Cheng
Zunnan Xu
Zhaoyang Zhang
Yixiao Ge
Xiu Li
Mingyu Ding
232
6
0
03 Jun 2025
Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Di Wen
Lei Qi
Kunyu Peng
Kailun Yang
Fei Teng
...
Yufan Chen
R. Liu
Yitian Shi
M. Sarfraz
Rainer Stiefelhagen
406
0
0
03 Jun 2025
Is Extending Modality The Right Path Towards Omni-Modality?
Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu
Kai Zhang
Muhao Chen
Eric Fosler-Lussier
VLM
270
3
0
02 Jun 2025
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Shivam Chandhok
Qian Yang
Oscar Manas
Kanishk Jain
Leonid Sigal
Aishwarya Agrawal
207
1
0
01 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
230
0
0
01 Jun 2025
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Yudong Zhang
Ruobing Xie
Yiqing Huang
Jiansheng Chen
Xingwu Sun
Zhanhui Kang
Di Wang
Yu Wang
AAML
334
1
0
01 Jun 2025
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
Yanyuan Qiao
Haodong Hong
Wenqi Lyu
Dong An
Siqi Zhang
Yutong Xie
Xinyu Wang
Qi Wu
LM&Ro
243
4
0
01 Jun 2025
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
Enhancing Multimodal Continual Instruction Tuning with BranchLoRAAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Duzhen Zhang
Yong Ren
Zhong-Zhi Li
Yahan Yu
Jiahua Dong
Chenxing Li
Zhilong Ji
Jinfeng Bai
CLL
219
4
0
31 May 2025
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Yaxin Luo
Zhaoyi Li
Jiacheng Liu
Jiacheng Cui
Xiaohan Zhao
Zhiqiang Shen
LLMAGLRMVLM
266
7
0
30 May 2025
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Redwan Sony
Parisa Farmanifard
Hamzeh Alzwairy
Nitish Shukla
Arun Ross
CVBMVLM
258
4
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
203
19
0
30 May 2025
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Chen Huang
Skyler Seto
Hadi Pouransari
Mehrdad Farajtabar
Raviteja Vemulapalli
Fartash Faghri
Oncel Tuzel
B. Theobald
Josh Susskind
CLL
300
0
0
30 May 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Junyu Luo
Zhizhuo Kou
Liming Yang
Xiao Luo
Jinsheng Huang
...
Jiaming Ji
Xuanzhe Liu
Sirui Han
Ming Zhang
Wenhan Luo
174
14
0
30 May 2025
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Danny Driess
Jost Tobias Springenberg
Brian Ichter
Lili Yu
Adrian Li-Bell
...
Allen Z. Ren
Homer Walke
Quan Vuong
Lucy Xiaoyang Shi
Sergey Levine
293
44
0
29 May 2025
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Yiming Lei
Zhizheng Yang
Zeming Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
267
0
0
29 May 2025
Multi-Sourced Compositional Generalization in Visual Question Answering
Multi-Sourced Compositional Generalization in Visual Question AnsweringInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Chuanhao Li
Wenbo Ye
Zhen Li
Yuwei Wu
Yunde Jia
CoGe
213
0
0
29 May 2025
Synthetic Document Question Answering in Hungarian
Synthetic Document Question Answering in Hungarian
Jonathan Li
Zoltan Csaki
Nidhi Hiremath
Etash Guha
Fenglu Hong
Edward Ma
Urmish Thakker
204
0
0
29 May 2025
NegVQA: Can Vision Language Models Understand Negation?
NegVQA: Can Vision Language Models Understand Negation?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuhui Zhang
Yuchang Su
Yiming Liu
Serena Yeung-Levy
MLLMCoGe
208
4
0
28 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
Haitao Mi
Dong Yu
VLM
304
7
0
28 May 2025
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
Linglin Jing
Yuting Gao
Zhigang Wang
Wang Lan
Yiwen Tang
Wenhai Wang
Kaipeng Zhang
Qingpei Guo
MoE
211
1
0
28 May 2025
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Shurong Zheng
Fan Yang
Ming Tang
Jinqiao Wang
VLMLRM
271
1
0
27 May 2025
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2025
Chengyue Huang
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
AAML
280
4
0
27 May 2025
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
ConText-CIR: Learning from Concepts in Text for Composed Image RetrievalComputer Vision and Pattern Recognition (CVPR), 2025
Eric Xing
Pranavi Kolouju
Robert Pless
Abby Stylianou
Nathan Jacobs
276
2
0
27 May 2025
Multimodal Federated Learning: A Survey through the Lens of Different FL Paradigms
Multimodal Federated Learning: A Survey through the Lens of Different FL Paradigms
Yuanzhe Peng
Jieming Bian
Lei Wang
Yin Huang
Jie Xu
207
1
0
27 May 2025
RefAV: Towards Planning-Centric Scenario Mining
RefAV: Towards Planning-Centric Scenario Mining
Cainan Davidson
Deva Ramanan
Neehar Peri
395
6
0
27 May 2025
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Bozhou Li
Wentao Zhang
VLM
170
1
0
27 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLMLRM
297
12
0
26 May 2025
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Hyunsik Chae
Seungwoo Yoon
J. Park
Chloe Yewon Chun
Yongin Cho
Mu Cai
Yong Jae Lee
Ernest K. Ryu
CoGeVLM
282
3
0
26 May 2025
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals
Jian Lan
Yifei Fu
Udo Schlegel
Gengyuan Zhang
Tanveer Hannan
Haokun Chen
Thomas Seidl
159
3
0
26 May 2025
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Xinmiao Hu
C. Wang
Ruihe An
ChenYu Shao
Xiaojun Ye
Sheng Zhou
Liangcheng Li
MLLMLRM
276
2
0
26 May 2025
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li
Fan Lyu
LRM
216
0
0
26 May 2025
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Mohammad Mahdi Moradi
Sudhir Mudur
203
0
0
25 May 2025
ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance
ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance
Duo Li
Zuhao Yang
Xiaoqin Zhang
Ling Shao
Shijian Lu
VLM
477
1
0
24 May 2025
Caption This, Reason That: VLMs Caught in the Middle
Caption This, Reason That: VLMs Caught in the Middle
Zihan Weng
Lucas Gomez
Taylor Whittington Webb
P. Bashivan
VLMLRM
366
0
0
24 May 2025
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
Yuxin Yang
Yinan Zhou
Yuxin Chen
Ziqi Zhang
Zongyang Ma
...
Bing Li
Lin Song
Jun Gao
Peng Li
Weiming Hu
454
1
0
23 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
434
1
0
23 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLMVLM
877
0
0
23 May 2025
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Yanshu Li
Tian Yun
Tian Yun
Pinyuan Feng
Jinfa Huang
Ruixiang Tang
410
23
0
21 May 2025
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Zeqing Wang
Shiyuan Zhang
Chengpei Tang
Keze Wang
LRM
179
3
0
21 May 2025
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and DecouplingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Junlin Li
Guodong DU
Jing Li
Sim Kuan Goh
Wenya Wang
...
Fangming Liu
Jing Li
Saleh Alharbi
Daojing He
Min Zhang
MoMeCLL
366
1
0
21 May 2025
Visual Question Answering on Multiple Remote Sensing Image Modalities
Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid
Lucrezia Tosato
F. Weissgerber
Camille Kurtz
Laurent Wendling
Sylvain Lobry
169
4
0
21 May 2025
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
Ingeol Baek
Hwan Chang
Sunghyun Ryu
Hwanhee Lee
189
2
0
21 May 2025
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Yanshu Li
JianJiang Yang
Ziteng Yang
Bozheng Li
Yi Cao
...
Ligong Han
Yingjie Victor Chen
Songlin Fei
Dongfang Liu
Ruixiang Tang
275
8
0
21 May 2025
VoQA: Visual-only Question Answering
VoQA: Visual-only Question Answering
Jianing An
Luyang Jiang
Jie Luo
Wenjun Wu
Lei Huang
LRM
323
0
0
20 May 2025
ModRWKV: Transformer Multimodality in Linear Time
ModRWKV: Transformer Multimodality in Linear Time
Jiale Kang
Ziyin Yue
Qingyu Yin
Jiang Rui
W. Li
Zening Lu
Zhouran Ji
OffRL
233
0
0
20 May 2025
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning
Kai Zhang
Xingyu Chen
Xiaofeng Zhang
290
1
0
19 May 2025
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
Yichen Guo
Hanze Li
Zonghao Zhang
Jinhao You
Kai Tang
Xiande Huang
VLM
198
0
0
18 May 2025
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Yang Liu
Ming Ma
Xiaomin Yu
Pengxiang Ding
Han Zhao
Mingyang Sun
Siteng Huang
Xuetao Zhang
LRM
502
19
0
18 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
334
2
0
17 May 2025
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Zihao Dongfang
Xu Zheng
Ziqiao Weng
Yuanhuiyi Lyu
Danda Pani Paudel
Luc Van Gool
Kailun Yang
Xuming Hu
LRM
276
8
0
17 May 2025
Previous
123...567...444546
Next