ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,276 papers shown
Title
HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark
HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark
Aniket Pal
Ajoy Mondal
Minesh Mathew
C. V. Jawahar
VLM
88
0
0
21 Jul 2025
Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025
Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025
Sujata Gaihre
Amir Thapa Magar
Prasuna Pokharel
Laxmi Tiwari
LM&MA
96
0
0
19 Jul 2025
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual QuestionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Pu Jian
Donglei Yu
Wen Yang
Shuo Ren
Jiajun Zhang
157
6
0
18 Jul 2025
Mitigating Object Hallucinations via Sentence-Level Early Intervention
Mitigating Object Hallucinations via Sentence-Level Early Intervention
Shangpin Peng
Senqiao Yang
Li Jiang
Zhuotao Tian
MLLM
239
5
0
16 Jul 2025
Describe Anything Model for Visual Question Answering on Text-rich Images
Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu
Dinh-Thang Duong
Truong-Binh Duong
Anh-Khoi Nguyen
Thanh-Huy Nguyen
...
Jianhua Xing
Xingjian Li
Tianyang Wang
Ulas Bagci
Min Xu
VLM
257
2
0
16 Jul 2025
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven
Kiana Shahrasbi
Tessa Verhoef
VLM
171
0
0
14 Jul 2025
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
Ramaneswaran Selvakumar
Ashish Seth
Nishit Anand
Utkarsh Tyagi
Sonal Kumar
Sreyan Ghosh
Dinesh Manocha
AuLLM
150
0
0
14 Jul 2025
MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
Qiyan Zhao
Xiaofeng Zhang
Yiheng Li
Yun Xing
Xiaosong Yuan
Feilong Tang
Sinan Fan
Xuhang Chen
Xuyao Zhang
Dahan Wang
211
3
0
12 Jul 2025
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Liu He
Xiao Zeng
Yizhi Song
Albert Y. C. Chen
Lu Xia
Shashwat Verma
Sankalp Dayal
Min Sun
Cheng-Hao Kuo
Daniel G. Aliaga
VGen
226
0
0
11 Jul 2025
Beyond the Linear Separability Ceiling: Aligning Representations in VLMs
Beyond the Linear Separability Ceiling: Aligning Representations in VLMs
Enrico Vompa
Tanel Tammet
Mohit Vaishnav
VLMLRM
193
0
0
10 Jul 2025
SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Qian Chen
Xianhao Chen
Kaibin Huang
MoE
213
2
0
09 Jul 2025
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Zhang Li
Biao Yang
Qiang Liu
Shuo Zhang
Zhiyin Ma
Liang Yin
Linger Deng
Yabo Sun
Yuliang Liu
Xiang Bai
452
0
0
08 Jul 2025
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
Rui Yu
J. Zhang
Zhenye Gan
Qingdong He
Xiaobin Hu
...
Chengjie Wang
Zhucun Xue
Chaoyou Fu
Xinwei He
Xiang Bai
VLM
113
0
0
07 Jul 2025
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu
Enxin Song
Wenhao Chai
Xuexiang Wen
Tian-Chun Ye
Gaoang Wang
308
5
0
03 Jul 2025
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment
Rui Xu
Yunke Wang
Yong Luo
Bo Du
VLM
186
1
0
27 Jun 2025
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
Liangyu Zhong
Fabio Rosenthal
Joachim Sicking
Fabian Hüger
Thorsten Bagdonat
Hanno Gottschalk
Leo Schwinn
LRM
141
2
0
26 Jun 2025
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Hao Li
Shuai Yang
Yilun Chen
Xinyi Chen
Xiaoda Yang
...
Hanqing Wang
Tai Wang
Dahua Lin
Feng Zhao
Jiangmiao Pang
170
6
0
24 Jun 2025
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
Teng Li
Quanfeng Lu
Lirui Zhao
Hao Li
X. Zhu
Yu Qiao
Jun Zhang
Wenqi Shao
220
4
0
20 Jun 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLMVLM
158
2
0
20 Jun 2025
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Zijian Song
Xiaoxin Lin
Qiuming Huang
Guangrun Wang
Liang Lin
LRM
348
5
0
17 Jun 2025
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design
Kai Lan
Jiayong Zhu
Jiangtong Li
Dawei Cheng
Guang-Sheng Chen
Changjun Jiang
LRM
151
2
0
16 Jun 2025
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Shaolei Zhang
Shoutao Guo
Qingkai Fang
Yan Zhou
Yang Feng
MLLMAuLLMVLM
244
8
0
16 Jun 2025
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence
Jianlong Wu
Sihao Liu
Chuan Rao
Bang An
Tiancheng Shen
Juil Sock
Ming-Hsuan Yang
Bernard Ghanem
216
4
0
16 Jun 2025
PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue
PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue
Eugene Vorontsov
Eugene Vorontsov
Adam Casson
Julian Viret
Eric Zimmermann
...
Razik Yousfi
Nicolò Fusi
Thomas J. Fuchs
Kristen Severson
Siqi Liu
MedImLM&MA
173
5
0
16 Jun 2025
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning
Haotian Zhang
Liu Liu
Baosheng Yu
Jiayan Qiu
Yanwei Ren
Xianglong Liu
186
0
0
14 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
308
0
0
13 Jun 2025
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu
Jiuhai Chen
Zhaojiang Lin
Xichen Pan
Lifu Huang
...
Di Jin
Michihiro Yasunaga
Lili Yu
Xi Lin
Shaoliang Nie
330
4
0
12 Jun 2025
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang
Mengzhen Liu
Lichen Li
Ming Lu
Yuan Zhang
Junwen Pan
Qi She
Shanghang Zhang
VLM
376
17
0
12 Jun 2025
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Ayush Gupta
A. Roy
Rama Chellappa
Nathaniel D. Bastian
Alvaro Velasquez
Susmit Jha
165
0
0
11 Jun 2025
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Benjamin Z. Reichman
Constantin Patsch
Jack Truxal
Atishay Jain
Larry Heck
197
0
0
11 Jun 2025
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Benno Krojer
Mojtaba Komeili
Candace Ross
Q. Garrido
Koustuv Sinha
Nicolas Ballas
Mahmoud Assran
281
4
0
11 Jun 2025
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin
Dana Arad
Yossi Gandelsman
Yonatan Belinkov
288
5
0
10 Jun 2025
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Dianyi Wang
Wei Song
Yikun Wang
Siyuan Wang
Kaicheng Yu
Zhongyu Wei
Jiaqi Wang
186
3
0
10 Jun 2025
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Zheda Mai
A. Chowdhury
Zihe Wang
Sooyoung Jeon
Jingyan Bai
Jiacheng Hou
Jihyung Kil
Wei-Lun Chao
CoGe
223
4
0
10 Jun 2025
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
Pranav Guruprasad
Yangyue Wang
Sudipta Chowdhury
Jaewoo Song
Harshvardhan Sikka
228
0
0
10 Jun 2025
Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests
Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests
Arnau Igualde Sáez
Lamyae Rhomrasi
Yusef Ahsini
Ricardo Vinuesa
S. Hoyas
Jose P. García Sabater
Marius J. Fullana i Alfonso
J. Alberto Conejero
LRM
177
0
0
09 Jun 2025
Synthetic Visual Genome
Synthetic Visual GenomeComputer Vision and Pattern Recognition (CVPR), 2025
J. S. Park
Zixian Ma
Linjie Li
Chenhao Zheng
Cheng-Yu Hsieh
...
Quan Kong
Norimasa Kobori
Ali Farhadi
Yejin Choi
Ranjay Krishna
200
0
0
09 Jun 2025
A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Vishaal Udandarao
Mehdi Cherti
Shyamgopal Karthik
J. Jitsev
Samuel Albanie
Matthias Bethge
CoGe
184
1
0
09 Jun 2025
A Neurosymbolic Agent System for Compositional Visual Reasoning
A Neurosymbolic Agent System for Compositional Visual Reasoning
Yichang Xu
Gaowen Liu
Ramana Rao Kompella
Sihao Hu
Tiansheng Huang
Fatih Ilhan
Selim Furkan Tekin
Zachary Yahn
LRMVLM
217
0
0
09 Jun 2025
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Tianyi Bai
Yuxuan Fan
Jiantao Qiu
Fupeng Sun
Jiayi Song
Junlin Han
Zichen Liu
Conghui He
Wentao Zhang
Binhang Yuan
MLLMVLM
248
2
0
08 Jun 2025
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury
Mohamed Elmoghany
Yohan Abeysinghe
Mahmoud Ahmed
Sayan Nag
Salman Khan
Mohamed Elhoseiny
Dinesh Manocha
345
5
0
08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
FREE: Fast and Robust Vision Language Models with Early ExitsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Divya J. Bajpai
M. Hanawal
VLM
141
2
0
07 Jun 2025
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Z. Babaiee
Peyman M. Kiasari
Daniela Rus
Radu Grosu
150
1
0
06 Jun 2025
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Fanhu Zeng
Deli Yu
Zhenglun Kong
Hao Tang
ViT
162
6
0
06 Jun 2025
CoMemo: LVLMs Need Image Context with Image Memory
CoMemo: LVLMs Need Image Context with Image Memory
Shi-Qi Liu
Weijie Su
Xizhou Zhu
Wenhai Wang
Jifeng Dai
VLM
177
0
0
06 Jun 2025
TextVidBench: A Benchmark for Long Video Scene Text Understanding
Yangyang Zhong
Ji Qi
Yuan Yao
Pengxin Luo
Yunfeng Yan
Donglian Qi
Zhiyuan Liu
Tat-Seng Chua
264
0
0
05 Jun 2025
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang
Z. Liu
Yongming Rao
Jiwen Lu
VLMLRM
445
3
0
05 Jun 2025
CIVET: Systematic Evaluation of Understanding in VLMs
CIVET: Systematic Evaluation of Understanding in VLMs
Massimo Rizzoli
Simone Alghisi
Olha Khomyn
Gabriel Roccabruna
Seyed Mahed Mousavi
Giuseppe Riccardi
352
1
0
05 Jun 2025
Coordinated Robustness Evaluation Framework for Vision-Language Models
Coordinated Robustness Evaluation Framework for Vision-Language Models
Ashwin Ramesh Babu
Sajad Mousavi
Vineet Gundecha
Sahand Ghorbanpour
Avisek Naug
Antonio Guillen
Ricardo Luna Gutierrez
Soumyendu Sarkar
AAML
175
0
0
05 Jun 2025
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Jiulong Wu
Zhengliang Shi
Shuaiqiang Wang
J. Huang
Dawei Yin
Lingyong Yan
Min Cao
Min Zhang
MLLM
293
1
0
04 Jun 2025
Previous
123456...444546
Next