Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1505.00468
Cited By
VQA: Visual Question Answering
3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VQA: Visual Question Answering"
50 / 792 papers shown
Title
Contrastive Vision-Language Pre-training with Limited Resources
Quan Cui
Boyan Zhou
Yu Guo
Weidong Yin
Hao Wu
Osamu Yoshie
Yubo Chen
VLM
CLIP
16
32
0
17 Dec 2021
TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning
Shiming Chen
Zi-Quan Hong
Wenjin Hou
Guosen Xie
Yibing Song
Jian-jun Zhao
Xinge You
Shuicheng Yan
Ling Shao
ViT
17
44
0
16 Dec 2021
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Zhecan Wang
Haoxuan You
Liunian Harold Li
Alireza Zareian
Suji Park
Yiqing Liang
Kai-Wei Chang
Shih-Fu Chang
ReLM
LRM
15
30
0
16 Dec 2021
3D Question Answering
Shuquan Ye
Dongdong Chen
Songfang Han
Jing Liao
ViT
24
46
0
15 Dec 2021
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
Jianjian Cao
Xiameng Qin
Sanyuan Zhao
Jianbing Shen
25
20
0
14 Dec 2021
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision
A. Shrivastava
Ramprasaath R. Selvaraju
Nikhil Naik
Vicente Ordonez
VLM
CLIP
22
6
0
14 Dec 2021
ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition
Xinyu Wang
Min Gui
Yong-jia Jiang
Zixia Jia
Nguyen Bach
Tao Wang
Zhongqiang Huang
Fei Huang
Kewei Tu
31
52
0
13 Dec 2021
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
Yining Hong
Li Yi
J. Tenenbaum
Antonio Torralba
Chuang Gan
9
39
0
09 Dec 2021
Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices
Hariom A. Pandya
Brijesh S. Bhatt
38
27
0
07 Dec 2021
MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering
Fangzhi Xu
Qika Lin
J. Liu
Lingling Zhang
Tianzhe Zhao
Qianyi Chai
Yudai Pan
9
2
0
06 Dec 2021
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Yongming Rao
Wenliang Zhao
Guangyi Chen
Yansong Tang
Zheng Zhu
Guan Huang
Jie Zhou
Jiwen Lu
VLM
CLIP
94
551
0
02 Dec 2021
Consensus Graph Representation Learning for Better Grounded Image Captioning
Wenqiao Zhang
Haochen Shi
Siliang Tang
Jun Xiao
Qiang Yu
Yueting Zhuang
15
53
0
02 Dec 2021
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
Christopher Clark
Jordi Salvador
Dustin Schwenk
Derrick Bonafilia
Mark Yatskar
...
Aaron Sarnat
Hannaneh Hajishirzi
Aniruddha Kembhavi
Oren Etzioni
Ali Farhadi
MLLM
20
3
0
01 Dec 2021
Classification-Regression for Chart Comprehension
Matan Levy
Rami Ben-Ari
Dani Lischinski
23
15
0
29 Nov 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
Yufan Zhou
Ruiyi Zhang
Changyou Chen
Chunyuan Li
Chris Tensmeyer
Tong Yu
Jiuxiang Gu
Jinhui Xu
Tong Sun
VLM
21
162
0
27 Nov 2021
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
25
30
0
26 Nov 2021
Confounder Identification-free Causal Visual Feature Learning
Xin Li
Zhizheng Zhang
Guoqiang Wei
Cuiling Lan
Wenjun Zeng
Xin Jin
Zhibo Chen
CML
OOD
16
14
0
26 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
19
111
0
23 Nov 2021
DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion
Arthur Douillard
Alexandre Ramé
Guillaume Couairon
Matthieu Cord
CLL
30
295
0
22 Nov 2021
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Yaya Shi
Xu Yang
Haiyang Xu
Chunfen Yuan
Bing Li
Weiming Hu
Zhengjun Zha
36
33
0
17 Nov 2021
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
24
12
0
17 Nov 2021
Language bias in Visual Question Answering: A Survey and Taxonomy
Desen Yuan
22
12
0
16 Nov 2021
Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma Distributions
Huan Ma
Zongbo Han
Changqing Zhang
H. Fu
Joey Tianyi Zhou
Q. Hu
EDL
UQCV
68
42
0
11 Nov 2021
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGS
ViT
71
330
0
11 Nov 2021
ICDAR 2021 Competition on Document VisualQuestion Answering
Rubèn Pérez Tito
Minesh Mathew
C. V. Jawahar
Ernest Valveny
Dimosthenis Karatzas
32
23
0
10 Nov 2021
Visual Question Answering based on Formal Logic
Muralikrishnna G. Sethuraman
Ali Payani
Faramarz Fekri
J. C. Kerce
NAI
16
3
0
08 Nov 2021
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Renrui Zhang
Rongyao Fang
Wei Zhang
Peng Gao
Kunchang Li
Jifeng Dai
Yu Qiao
Hongsheng Li
VLM
189
385
0
06 Nov 2021
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
Mingyu Ding
Zhenfang Chen
Tao Du
Ping Luo
J. Tenenbaum
Chuang Gan
VGen
PINN
OCL
27
74
0
28 Oct 2021
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu
Liang Qiu
Jiaqi Chen
Tony Xia
Yizhou Zhao
Wei Zhang
Zhou Yu
Xiaodan Liang
Song-Chun Zhu
AIMat
28
183
0
25 Oct 2021
Instance-Conditional Knowledge Distillation for Object Detection
Zijian Kang
Peizhen Zhang
X. Zhang
Jian-jun Sun
N. Zheng
14
76
0
25 Oct 2021
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu
11
57
0
19 Oct 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
224
1,018
0
13 Oct 2021
Topic Scene Graph Generation by Attention Distillation from Caption
Wenbin Wang
R. Wang
X. Chen
DiffM
17
14
0
12 Oct 2021
Pano-AVQA: Grounded Audio-Visual Question Answering on 360
∘
^\circ
∘
Videos
Heeseung Yun
Youngjae Yu
Wonsuk Yang
Kangil Lee
Gunhee Kim
25
78
0
11 Oct 2021
Calling to CNN-LSTM for Rumor Detection: A Deep Multi-channel Model for Message Veracity Classification in Microblogs
Abderrazek Azri
Cécile Favre
Nouria Harbi
Jérôme Darmont
C. Noûs
17
11
0
11 Oct 2021
Asking questions on handwritten document collections
Minesh Mathew
Lluís Gómez
Dimosthenis Karatzas
C. V. Jawahar
RALM
20
11
0
02 Oct 2021
Multimodal Integration of Human-Like Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Philippe Muller
Dominike Thomas
Mihai Bâce
Andreas Bulling
33
16
0
27 Sep 2021
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Florian Strohm
Prajit Dhar
Andreas Bulling
29
19
0
27 Sep 2021
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Xingyao Wang
David Jurgens
20
5
0
24 Sep 2021
How to find a good image-text embedding for remote sensing visual question answering?
Christel Chappuis
Sylvain Lobry
B. Kellenberger
Bertrand Le Saux
D. Tuia
32
20
0
24 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLM
VPVLM
VLM
202
221
0
24 Sep 2021
Caption Enriched Samples for Improving Hateful Memes Detection
Efrat Blaier
Itzik Malkiel
Lior Wolf
VLM
51
21
0
22 Sep 2021
A Survey on Temporal Sentence Grounding in Videos
Xiaohan Lan
Yitian Yuan
Xin Eric Wang
Zhi Wang
Wenwu Zhu
27
47
0
16 Sep 2021
Knowledge-based Embodied Question Answering
Sinan Tan
Mengmeng Ge
Di Guo
Huaping Liu
F. Sun
22
20
0
16 Sep 2021
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
Aitor Soroa Etxabe
Eneko Agirre
27
59
0
15 Sep 2021
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
Jihyung Kil
Cheng Zhang
D. Xuan
Wei-Lun Chao
56
20
0
13 Sep 2021
Panoptic Narrative Grounding
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
76
22
0
10 Sep 2021
TxT: Crossmodal End-to-End Learning with Transformers
Jan-Martin O. Steitz
Jonas Pfeiffer
Iryna Gurevych
Stefan Roth
LRM
16
2
0
09 Sep 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
25
36
0
09 Sep 2021
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
Tiezheng Yu
Wenliang Dai
Zihan Liu
Pascale Fung
24
73
0
06 Sep 2021
Previous
1
2
3
...
8
9
10
...
14
15
16
Next