Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
A Survey on Visual Transfer Learning using Knowledge Graphs
Sebastian Monka
Lavdim Halilaj
Achim Rettinger
254
27
0
27 Jan 2022
PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy Labels
A. Goel
Yunlong Jiao
Jordan Massiah
NoLa
170
10
0
26 Jan 2022
Supervised Visual Attention for Simultaneous Multimodal Machine Translation
Journal of Artificial Intelligence Research (JAIR), 2022
Veneta Haralampieva
Ozan Caglayan
Lucia Specia
LRM
223
4
0
23 Jan 2022
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching
Neurocomputing (Neurocomputing), 2022
Hengcan Shi
Munawar Hayat
Jianfei Cai
ObjD
209
12
0
18 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
222
24
0
11 Jan 2022
Semantically Grounded Visual Embeddings for Zero-Shot Learning
Shah Nawaz
Jacopo Cavazza
Alessio Del Bue
ObjD
FedML
VLM
285
6
0
03 Jan 2022
Deconfounded Visual Grounding
AAAI Conference on Artificial Intelligence (AAAI), 2021
Jianqiang Huang
Yu Qin
Jiaxin Qi
Qianru Sun
Hanwang Zhang
CML
ObjD
203
38
0
31 Dec 2021
Grounding Linguistic Commands to Navigable Regions
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2021
N. Rufus
Kanishk Jain
U. R. Nair
Vineet Gandhi
K. M. Krishna
LM&Ro
214
13
0
24 Dec 2021
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
European Conference on Computer Vision (ECCV), 2021
Golnaz Ghiasi
Xiuye Gu
Huayu Chen
Nayeon Lee
VLM
444
497
0
22 Dec 2021
A Survey of Natural Language Generation
ACM Computing Surveys (CSUR), 2021
Chenhe Dong
Hai-Tao Zheng
Haifan Gong
Mengzhao Chen
Junxin Li
Ying Shen
Min Yang
3DV
336
65
0
22 Dec 2021
ScanQA: 3D Question Answering for Spatial Scene Understanding
Computer Vision and Pattern Recognition (CVPR), 2021
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
M. Kawanabe
444
328
0
20 Dec 2021
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds
Ayush Jain
N. Gkanatsios
Ishita Mediratta
Katerina Fragkiadaki
ObjD
492
148
0
16 Dec 2021
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLM
FedML
214
35
0
16 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
303
137
0
14 Dec 2021
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Yue Yu
Yu-Gang Jiang
VLM
MLLM
221
11
0
10 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
383
873
0
08 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
167
9
0
08 Dec 2021
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjD
VLM
468
1,407
0
07 Dec 2021
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection
Maan Qraitem
Bryan A. Plummer
ObjD
200
0
0
06 Dec 2021
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen
Qirui Wu
Matthias Nießner
Angel X. Chang
196
51
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
252
152
0
02 Dec 2021
Weakly-Supervised Video Object Grounding via Causal Intervention
Wei Wang
Junyu Gao
Changsheng Xu
CML
325
32
0
01 Dec 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
357
134
0
23 Nov 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
409
1,060
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
European Conference on Computer Vision (ECCV), 2021
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
625
117
0
22 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
International Conference on Machine Learning (ICML), 2021
Yan Zeng
Xinsong Zhang
Hang Li
VLM
CLIP
345
356
0
16 Nov 2021
Memotion Analysis through the Lens of Joint Embedding
AAAI Conference on Artificial Intelligence (AAAI), 2021
Nethra Gunti
Sathyanarayanan Ramamoorthy
Parth Patwa
Amitava Das
130
8
0
13 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
International Conference on Learning Representations (ICLR), 2021
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
343
769
0
09 Nov 2021
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Jianqing Fan
186
7
0
05 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Computer Vision and Pattern Recognition (CVPR), 2021
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
302
438
0
03 Nov 2021
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Neural Information Processing Systems (NeurIPS), 2021
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
981
693
0
03 Nov 2021
Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network
Yuansan Liu
MD Abdullah Al Nasim
Sourav Saha
Faria Afrin
Raisa Mallik
Sathishkumar Samiappan
ViT
152
16
0
24 Oct 2021
Text-Based Person Search with Limited Data
British Machine Vision Conference (BMVC), 2021
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
210
125
0
20 Oct 2021
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
Neural Information Processing Systems (NeurIPS), 2021
Reuben Tan
Bryan A. Plummer
Kate Saenko
Hailin Jin
Bryan C. Russell
SSL
213
28
0
20 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval
Knowledge-Based Systems (KBS), 2021
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViT
VLM
220
24
0
20 Oct 2021
Towards Language-guided Visual Recognition via Dynamic Convolutions
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yongjian Wu
Yue Gao
Rongrong Ji
ObjD
242
27
0
17 Oct 2021
Unsupervised Natural Language Inference Using PHL Triplet Generation
Neeraj Varshney
Pratyay Banerjee
Tejas Gokhale
Chitta Baral
261
10
0
16 Oct 2021
Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
459
20
0
06 Oct 2021
Learning Structural Representations for Recipe Generation and Food Retrieval
Hao Wang
Guosheng Lin
Guosheng Lin
Chunyan Miao
157
38
0
04 Oct 2021
CIDEr-R: Robust Consensus-based Image Description Evaluation
G. O. D. Santos
Esther Luna Colombini
Sandra Avila
161
42
0
28 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLM
VPVLM
VLM
594
245
0
24 Sep 2021
Discovering and Validating AI Errors With Crowdsourced Failure Reports
Ángel Alexander Cabrera
Abraham J. Druck
Jason I. Hong
Adam Perer
HAI
181
63
0
23 Sep 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIP
VLM
283
32
0
22 Sep 2021
Associative Memories via Predictive Coding
Tommaso Salvatori
Yuhang Song
Yujian Hong
Simon Frieder
Lei Sha
Zhenghua Xu
Rafal Bogacz
Thomas Lukasiewicz
196
78
0
16 Sep 2021
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Da Yin
Liunian Harold Li
Ziniu Hu
Nanyun Peng
Kai-Wei Chang
300
65
0
14 Sep 2021
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
362
80
0
13 Sep 2021
DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
A. Zhu
Zijie Wang
Yifeng Li
Xili Wan
Jing Jin
Tian Wang
Fangqiang Hu
G. Hua
267
253
0
12 Sep 2021
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Haijun Shan
Xuanjing Huang
Jianqing Fan
CLIP
118
12
0
12 Sep 2021
Panoptic Narrative Grounding
IEEE International Conference on Computer Vision (ICCV), 2021
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
258
28
0
10 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Stella Frank
Emanuele Bugliarello
Desmond Elliott
190
95
0
09 Sep 2021
Previous
1
2
3
...
19
20
21
...
25
26
27
Next
Page 20 of 27
Page
of 27
Go