ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01883
  4. Cited By
All You May Need for VQA are Image Captions

All You May Need for VQA are Image Captions

North American Chapter of the Association for Computational Linguistics (NAACL), 2022
4 May 2022
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
ArXiv (abs)PDFHTML

Papers citing "All You May Need for VQA are Image Captions"

50 / 56 papers shown
Title
Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
Ju-Young Oh
55
0
0
18 Nov 2025
SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Yan Zhang
Jiaqing Lin
Miao Zhang
Kui Xiao
Xiaoju Hou
Yue Zhao
Ruoyao Xiao
82
0
0
25 Sep 2025
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
A. S. Penamakuri
Navlika Singh
Piyush Arora
Anand Mishra
VLM
99
1
0
20 Sep 2025
Adapting Vision-Language Models for Evaluating World Models
Adapting Vision-Language Models for Evaluating World Models
Mariya Hendriksen
Tabish Rashid
David Bignell
Raluca Georgescu
Abdelhak Lemkhenter
Katja Hofmann
Sam Devlin
Sarah Parisot
117
0
0
22 Jun 2025
Capturing Visualization Design Rationale
Capturing Visualization Design Rationale
Maeve Hutchinson
Radu Jianu
A. Slingsby
Jo Wood
Pranava Madhyastha
96
0
0
19 Jun 2025
SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring
SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring
Chuming Shen
Wei Wei
Xiaoye Qu
Yu Cheng
LRM
330
8
0
25 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
313
1
0
01 May 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Weilong Dai
288
1
0
06 Mar 2025
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math ImagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Sami Baral
L. Lucy
Ryan Knight
Alice Ng
Luca Soldaini
Neil T. Heffernan
Kyle Lo
260
10
0
28 Jan 2025
MedCoT: Medical Chain of Thought via Hierarchical Expert
MedCoT: Medical Chain of Thought via Hierarchical ExpertConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jiaxiang Liu
Yuan Wang
Jiawei Du
Qiufeng Wang
Zuozhu Liu
LRM
375
43
0
18 Dec 2024
An Entailment Tree Generation Approach for Multimodal Multi-Hop Question
  Answering with Mixture-of-Experts and Iterative Feedback Mechanism
An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback MechanismACM Multimedia (MM), 2024
Qing Zhang
Haocheng Lv
Jie Liu
Zheyu Chen
Jianyong Duan
Hao Wang
Li He
Mingying Xv
219
3
0
08 Dec 2024
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
Web-Scale Visual Entity Recognition: An LLM-Driven Data ApproachNeural Information Processing Systems (NeurIPS), 2024
Mathilde Caron
Alireza Fathi
Cordelia Schmid
Ahmet Iscen
205
3
0
31 Oct 2024
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
Xupeng Chen
Zhixin Lai
Kangrui Ruan
Shichu Chen
Jiaxiang Liu
Zuozhu Liu
553
14
0
27 Oct 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on
  Comprehensive Long Video Understanding
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
261
17
0
27 Sep 2024
Enhancing Visual Question Answering through Ranking-Based Hybrid
  Training and Multimodal Fusion
Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion
Peiyuan Chen
Zecheng Zhang
Yiping Dong
Li Zhou
Han Wang
206
16
0
14 Aug 2024
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of
  Few-Shot Learning
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Mustafa Dogan
.Ilker Kesen
Iacer Calixto
Aykut Erdem
Erkut Erdem
LRM
180
2
0
17 Jul 2024
Conceptual Learning via Embedding Approximations for Reinforcing
  Interpretability and Transparency
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter
Tsachi Blau
Chaim Baskin
247
0
0
13 Jun 2024
Language-guided Detection and Mitigation of Unknown Dataset Bias
Language-guided Detection and Mitigation of Unknown Dataset Bias
Zaiying Zhao
Soichiro Kumano
Toshihiko Yamasaki
189
2
0
05 Jun 2024
C3L: Content Correlated Vision-Language Instruction Tuning Data
  Generation via Contrastive Learning
C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning
Ji Ma
Wei Suo
Peng Wang
Yanning Zhang
VLM
217
0
0
21 May 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
264
54
0
10 Apr 2024
CIC: A Framework for Culturally-Aware Image Captioning
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
328
9
0
08 Feb 2024
Towards A Better Metric for Text-to-Video Generation
Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu
Guian Fang
Haoning Wu
Xintao Wang
Yixiao Ge
...
Rui Zhao
Weisi Lin
Wynne Hsu
Ying Shan
Mike Zheng Shou
VGen
214
44
0
15 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
  Text-Only Training
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
170
11
0
04 Jan 2024
AQUALLM: Audio Question Answering Data Generation Using Large Language
  Models
AQUALLM: Audio Question Answering Data Generation Using Large Language Models
Swarup Ranjan Behera
Krishna Mohan Injeti
Jaya Sai Kiran Patibandla
P. Pokala
Pailla Balakrishna Reddy
AuLLM
194
6
0
28 Dec 2023
A Strong Baseline for Temporal Video-Text Alignment
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya Zhang
Yanfeng Wang
Weidi Xie
AI4TSVGen
188
10
0
21 Dec 2023
Improving Zero-shot Visual Question Answering via Large Language Models
  with Reasoning Question Prompts
Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question PromptsACM Multimedia (ACM MM), 2023
Yunshi Lan
Xiang Li
Xin Liu
Yang Li
Wei Qin
Weining Qian
LRMReLM
333
37
0
15 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
360
68
0
01 Nov 2023
Davidsonian Scene Graph: Improving Reliability in Fine-grained
  Evaluation for Text-to-Image Generation
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image GenerationInternational Conference on Learning Representations (ICLR), 2023
Jaemin Cho
Yushi Hu
Roopal Garg
Peter Anderson
Ranjay Krishna
Jason Baldridge
Mohit Bansal
Jordi Pont-Tuset
Su Wang
EGVM
273
118
0
27 Oct 2023
Exploring Question Decomposition for Zero-Shot VQA
Exploring Question Decomposition for Zero-Shot VQANeural Information Processing Systems (NeurIPS), 2023
Zaid Khan
B. Vijaykumar
S. Schulter
Manmohan Chandraker
Yun Fu
ReLM
174
18
0
25 Oct 2023
Rephrase, Augment, Reason: Visual Grounding of Questions for
  Vision-Language Models
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Archiki Prasad
Elias Stengel-Eskin
Mohit Bansal
ReLMLRM
200
13
0
09 Oct 2023
Negative Object Presence Evaluation (NOPE) to Measure Object
  Hallucination in Vision-Language Models
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia
Wenliang Dai
Samuel Cahyawijaya
Ziwei Ji
Pascale Fung
MLLM
207
69
0
09 Oct 2023
Tackling VQA with Pretrained Foundation Models without Further Training
Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan
Bingquan Shen
MLLM
176
2
0
27 Sep 2023
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction
  Execution for Robots
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for RobotsIEEE International Conference on Robotics and Automation (ICRA), 2023
D. Rivkin
Nikhil Kakodkar
F. Hogan
Bobak H. Baghi
Gregory Dudek
LM&Ro
214
4
0
21 Jul 2023
Unified Language Representation for Question Answering over Text,
  Tables, and Images
Unified Language Representation for Question Answering over Text, Tables, and ImagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Yu Bowen
Cheng Fu
Haiyang Yu
Fei Huang
Yongbin Li
LMTD
225
29
0
29 Jun 2023
Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual
  Question Answering
Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question AnsweringInternational Conference on the Theory of Information Retrieval (ICTIR), 2023
Alireza Salemi
Mahta Rafiee
Hamed Zamani
137
13
0
28 Jun 2023
Encyclopedic VQA: Visual questions about detailed properties of
  fine-grained categories
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categoriesIEEE International Conference on Computer Vision (ICCV), 2023
Thomas Mensink
J. Uijlings
Lluis Castrejon
A. Goel
Felipe Cadar
Howard Zhou
Fei Sha
A. Araújo
V. Ferrari
242
75
0
15 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
  Tasks? A: Self-Train on Unlabeled Images!
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!Computer Vision and Pattern Recognition (CVPR), 2023
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLMMLLM
188
23
0
06 Jun 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
310
246
0
29 May 2023
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
  Text-to-Image Generation by Selection
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection
Shyamgopal Karthik
Karsten Roth
Goran Frehse
Zeynep Akata
188
31
0
22 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner
  and Dense Captioner
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerACM Multimedia (ACM MM), 2023
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
168
3
0
19 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
What You See is What You Read? Improving Text-Image Alignment EvaluationNeural Information Processing Systems (NeurIPS), 2023
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
451
115
0
17 May 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
  with Question Answering
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question AnsweringIEEE International Conference on Computer Vision (ICCV), 2023
Yushi Hu
Benlin Liu
Jungo Kasai
Yizhong Wang
Mari Ostendorf
Ranjay Krishna
Noah A. Smith
EGVM
240
331
0
21 Mar 2023
Architext: Language-Driven Generative Architecture Design
Architext: Language-Driven Generative Architecture Design
Theodoros Galanos
Antonios Liapis
Georgios N. Yannakakis
VLMAI4CE
187
7
0
13 Mar 2023
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
  Synthetic and Compositional Images
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional ImagesIEEE International Conference on Computer Vision (ICCV), 2023
Nitzan Bitton-Guetta
Yonatan Bitton
Jack Hessel
Ludwig Schmidt
Yuval Elovici
Gabriel Stanovsky
Roy Schwartz
VLM
352
85
0
13 Mar 2023
PaLM-E: An Embodied Multimodal Language Model
PaLM-E: An Embodied Multimodal Language ModelInternational Conference on Machine Learning (ICML), 2023
Danny Driess
F. Xia
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
...
Marc Toussaint
Klaus Greff
Andy Zeng
Igor Mordatch
Peter R. Florence
LM&Ro
357
2,153
0
06 Mar 2023
EVJVQA Challenge: Multilingual Visual Question Answering
EVJVQA Challenge: Multilingual Visual Question AnsweringJournal of Computer Science and Cybernetics (JCSC), 2023
Ngan Luu-Thuy Nguyen
Nghia Hieu Nguyen
Duong T.D. Vo
K. Tran
Kiet Van Nguyen
252
9
0
23 Feb 2023
Connecting Vision and Language with Video Localized Narratives
Connecting Vision and Language with Video Localized NarrativesComputer Vision and Pattern Recognition (CVPR), 2023
P. Voigtlaender
Soravit Changpinyo
Jordi Pont-Tuset
Radu Soricut
V. Ferrari
VGen
266
30
0
22 Feb 2023
Open-domain Visual Entity Recognition: Towards Recognizing Millions of
  Wikipedia Entities
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia EntitiesIEEE International Conference on Computer Vision (ICCV), 2023
Hexiang Hu
Yi Luan
Yang Chen
Urvashi Khandelwal
Mandar Joshi
Kenton Lee
Kristina Toutanova
Ming-Wei Chang
VLM
303
88
0
22 Feb 2023
MAQA: A Multimodal QA Benchmark for Negation
MAQA: A Multimodal QA Benchmark for Negation
Judith Yue Li
Aren Jansen
Qingqing Huang
Joonseok Lee
Ravi Ganti
Dima Kuzmin
167
6
0
09 Jan 2023
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
  Models
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022
Jiaxian Guo
Junnan Li
Dongxu Li
A. M. H. Tiong
Boyang Albert Li
Dacheng Tao
Steven C. H. Hoi
VLMMLLM
285
156
0
21 Dec 2022
12
Next