Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01883
Cited By
All You May Need for VQA are Image Captions
4 May 2022
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
Re-assign community
ArXiv
PDF
HTML
Papers citing
"All You May Need for VQA are Image Captions"
50 / 57 papers shown
Title
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
60
0
0
01 May 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Mingli Song
57
0
0
06 Mar 2025
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images
Sami Baral
L. Lucy
Ryan Knight
Alice Ng
Luca Soldaini
Neil T. Heffernan
Kyle Lo
41
3
0
28 Jan 2025
MedCoT: Medical Chain of Thought via Hierarchical Expert
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Zuozhu Liu
LRM
70
9
0
18 Dec 2024
An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism
Qing Zhang
Haocheng Lv
Jie Liu
Z. Chen
Jianyong Duan
Hao Wang
Li He
Mingying Xv
62
1
0
08 Dec 2024
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
Mathilde Caron
Alireza Fathi
Cordelia Schmid
Ahmet Iscen
26
0
0
31 Oct 2024
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
Xupeng Chen
Zhixin Lai
Kangrui Ruan
Shichu Chen
Jiaxiang Liu
Zuozhu Liu
33
1
0
27 Oct 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
34
6
0
27 Sep 2024
Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion
Peiyuan Chen
Zecheng Zhang
Yiping Dong
Li Zhou
Han Wang
27
12
0
14 Aug 2024
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Mustafa Dogan
.Ilker Kesen
Iacer Calixto
Aykut Erdem
Erkut Erdem
LRM
23
1
0
17 Jul 2024
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter
Tsachi Blau
Chaim Baskin
33
0
0
13 Jun 2024
Language-guided Detection and Mitigation of Unknown Dataset Bias
Zaiying Zhao
Soichiro Kumano
Toshihiko Yamasaki
34
2
0
05 Jun 2024
C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning
Ji Ma
Wei Suo
Peng Wang
Yanning Zhang
VLM
36
0
0
21 May 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
42
25
0
10 Apr 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
9
5
0
08 Feb 2024
Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu
Guian Fang
Haoning Wu
Xintao Wang
Yixiao Ge
...
Rui Zhao
Weisi Lin
Wynne Hsu
Ying Shan
Mike Zheng Shou
VGen
22
34
0
15 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
33
3
0
04 Jan 2024
AQUALLM: Audio Question Answering Data Generation Using Large Language Models
Swarup Ranjan Behera
Krishna Mohan Injeti
Jaya Sai Kiran Patibandla
P. Pokala
Pailla Balakrishna Reddy
AuLLM
11
4
0
28 Dec 2023
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya-Qin Zhang
Yanfeng Wang
Weidi Xie
AI4TS
VGen
14
5
0
21 Dec 2023
Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
Yunshi Lan
Xiang Li
Xin Liu
Yang Li
Wei Qin
Weining Qian
LRM
ReLM
17
23
0
15 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
35
35
0
01 Nov 2023
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
Jaemin Cho
Yushi Hu
Roopal Garg
Peter Anderson
Ranjay Krishna
Jason Baldridge
Mohit Bansal
Jordi Pont-Tuset
Su Wang
EGVM
22
65
0
27 Oct 2023
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan
B. Vijaykumar
S. Schulter
Manmohan Chandraker
Yun Fu
ReLM
17
9
0
25 Oct 2023
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
Archiki Prasad
Elias Stengel-Eskin
Mohit Bansal
ReLM
LRM
28
7
0
09 Oct 2023
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia
Wenliang Dai
Samuel Cahyawijaya
Ziwei Ji
Pascale Fung
MLLM
14
46
0
09 Oct 2023
Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan
Bingquan Shen
MLLM
10
1
0
27 Sep 2023
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots
D. Rivkin
Nikhil Kakodkar
F. Hogan
Bobak H. Baghi
Gregory Dudek
LM&Ro
11
3
0
21 Jul 2023
Unified Language Representation for Question Answering over Text, Tables, and Images
Yu Bowen
Cheng Fu
Haiyang Yu
Fei Huang
Yongbin Li
LMTD
14
10
0
29 Jun 2023
Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
Alireza Salemi
Mahta Rafiee
Hamed Zamani
16
8
0
28 Jun 2023
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
Thomas Mensink
J. Uijlings
Lluis Castrejon
A. Goel
Felipe Cadar
Howard Zhou
Fei Sha
A. Araújo
V. Ferrari
20
36
0
15 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLM
MLLM
16
17
0
06 Jun 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
33
186
0
29 May 2023
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection
Shyamgopal Karthik
Karsten Roth
Massimiliano Mancini
Zeynep Akata
24
20
0
22 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
J. Liu
6
1
0
19 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
31
72
0
17 May 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Yushi Hu
Benlin Liu
Jungo Kasai
Yizhong Wang
Mari Ostendorf
Ranjay Krishna
Noah A. Smith
EGVM
16
116
0
21 Mar 2023
Architext: Language-Driven Generative Architecture Design
Theodoros Galanos
Antonios Liapis
Georgios N. Yannakakis
VLM
AI4CE
23
6
0
13 Mar 2023
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Nitzan Bitton-Guetta
Yonatan Bitton
Jack Hessel
Ludwig Schmidt
Yuval Elovici
Gabriel Stanovsky
Roy Schwartz
VLM
121
65
0
13 Mar 2023
PaLM-E: An Embodied Multimodal Language Model
Danny Driess
F. Xia
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
...
Marc Toussaint
Klaus Greff
Andy Zeng
Igor Mordatch
Peter R. Florence
LM&Ro
18
1,539
0
06 Mar 2023
EVJVQA Challenge: Multilingual Visual Question Answering
N. Nguyen
Nghia Hieu Nguyen
Duong T.D. Vo
K. Tran
Kiet Van Nguyen
17
7
0
23 Feb 2023
Connecting Vision and Language with Video Localized Narratives
P. Voigtlaender
Soravit Changpinyo
Jordi Pont-Tuset
Radu Soricut
V. Ferrari
VGen
23
21
0
22 Feb 2023
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Hexiang Hu
Yi Luan
Yang Chen
Urvashi Khandelwal
Mandar Joshi
Kenton Lee
Kristina Toutanova
Ming-Wei Chang
VLM
43
54
0
22 Feb 2023
MAQA: A Multimodal QA Benchmark for Negation
Judith Yue Li
Aren Jansen
Qingqing Huang
Joonseok Lee
Ravi Ganti
Dima Kuzmin
14
5
0
09 Jan 2023
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Jiaxian Guo
Junnan Li
Dongxu Li
A. M. H. Tiong
Boyang Albert Li
Dacheng Tao
Steven C. H. Hoi
VLM
MLLM
16
106
0
21 Dec 2022
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung
Seongho Choi
Joo-Kyung Kim
Jin-Hwa Kim
Byoung-Tak Zhang
29
7
0
23 Oct 2022
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
A. M. H. Tiong
Junnan Li
Boyang Albert Li
Silvio Savarese
S. Hoi
MLLM
13
101
0
17 Oct 2022
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma
Silong Yong
Zilong Zheng
Qing Li
Yitao Liang
Song-Chun Zhu
Siyuan Huang
LM&Ro
10
129
0
14 Oct 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Xiao Wang
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
13
529
0
14 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
23
8
0
12 Sep 2022
PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks
Nan Ding
Xi Chen
Tomer Levinboim
Soravit Changpinyo
Radu Soricut
14
26
0
10 Mar 2022
1
2
Next