Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
The Case for Perspective in Multimodal Datasets
Marcelo Viridiano
Haiyue Song
Oliver Czulo
Arthur Lorenzi
E. Matos
Frederico Belcavello
118
7
0
22 May 2022
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLM
ViT
425
11
0
19 May 2022
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Computer Vision and Pattern Recognition (CVPR), 2022
Chia-Wen Kuo
Z. Kira
266
81
0
09 May 2022
RoViST:Learning Robust Metrics for Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
166
13
0
08 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLM
MLLM
274
111
0
05 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
708
1,616
0
04 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
281
18
0
02 May 2022
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
Computer Vision and Pattern Recognition (CVPR), 2022
Li Yang
Yan Xu
Chunfen Yuan
Wei Liu
Bing Li
Weiming Hu
ObjD
293
155
0
30 Apr 2022
Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval
North American Chapter of the Association for Computational Linguistics (NAACL), 2022
Siyu Ren
Kenny Q. Zhu
VLM
102
9
0
29 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
IEEE Transactions on Image Processing (IEEE TIP), 2022
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
339
13
0
21 Apr 2022
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing
European Conference on Computer Vision (ECCV), 2022
Benedikt Boecking
Naoto Usuyama
Shruthi Bannur
Daniel Coelho De Castro
Anton Schwaighofer
...
Tristan Naumann
A. Nori
Javier Alvarez-Valle
Hoifung Poon
Ozan Oktay
496
368
0
21 Apr 2022
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor
Guillaume Couairon
Asya Grechka
Matthieu Cord
ViT
204
23
0
20 Apr 2022
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Neural Information Processing Systems (NeurIPS), 2022
Chunyuan Li
Haotian Liu
Liunian Harold Li
Pengchuan Zhang
J. Aneja
...
Ping Jin
Houdong Hu
Zicheng Liu
Yong Jae Lee
Jianfeng Gao
297
177
0
19 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Computer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
262
55
0
15 Apr 2022
Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness
Paul Pu Liang
359
7
0
14 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
European Conference on Computer Vision (ECCV), 2022
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjD
MLLM
VLM
162
52
0
12 Apr 2022
Adapting CLIP For Phrase Localization Without Further Training
Jiahao Li
G. Shakhnarovich
Raymond A. Yeh
VLM
CLIP
222
26
0
07 Apr 2022
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
European Conference on Computer Vision (ECCV), 2022
Sanghyuk Chun
Wonjae Kim
Song Park
Minsuk Chang
Seong Joon Oh
VLM
1.5K
51
0
07 Apr 2022
Multi-View Transformer for 3D Visual Grounding
Computer Vision and Pattern Recognition (CVPR), 2022
Shijia Huang
Yilun Chen
Jiaya Jia
Liwei Wang
398
177
0
05 Apr 2022
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding
Ziyue Wu
Junyu Gao
Shucheng Huang
Changsheng Xu
241
6
0
04 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
European Conference on Computer Vision (ECCV), 2022
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
215
18
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Computer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Computer Vision and Pattern Recognition (CVPR), 2022
Jiabo Ye
Junfeng Tian
Ming Yan
Xiaoshan Yang
Xuwu Wang
Ji Zhang
Liang He
Xin Lin
ObjD
234
93
0
29 Mar 2022
Large-scale Bilingual Language-Image Contrastive Learning
ByungSoo Ko
Geonmo Gu
VLM
285
17
0
28 Mar 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
European Conference on Computer Vision (ECCV), 2022
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
357
22
0
27 Mar 2022
Knowledge Mining with Scene Text for Fine-Grained Recognition
Computer Vision and Pattern Recognition (CVPR), 2022
Hao Wang
Junchao Liao
Tianheng Cheng
Zewen Gao
Hao Liu
Bo Ren
X. Bai
Wenyu Liu
208
14
0
27 Mar 2022
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos
Computer Vision and Pattern Recognition (CVPR), 2022
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
247
43
0
22 Mar 2022
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
Shan Yuan
Shuai Zhao
Jiahong Leng
Zhao Xue
Hanyu Zhao
Peiyu Liu
Zheng Gong
Wayne Xin Zhao
Junyi Li
Tang Jie
VLM
270
6
0
22 Mar 2022
Finding Structural Knowledge in Multimodal-BERT
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
220
12
0
17 Mar 2022
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Computer Vision and Pattern Recognition (CVPR), 2022
Haojun Jiang
Yuanze Lin
Dongchen Han
Shiji Song
Gao Huang
ObjD
324
65
0
16 Mar 2022
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Computer Vision and Pattern Recognition (CVPR), 2022
Fawaz Sammani
Tanmoy Mukherjee
Nikos Deligiannis
MILM
ELM
LRM
321
75
0
09 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Neural Information Processing Systems (NeurIPS), 2022
Changdae Oh
Junhyuk So
Hoyoon Byun
Yongtaek Lim
Minchul Shin
Jong-June Jeon
Kyungwoo Song
458
39
0
08 Mar 2022
Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition
IEEE transactions on multimedia (IEEE TMM), 2022
Peipei Zhu
Tianlin Li
Yong Luo
Zhenglong Sun
Wei-Shi Zheng
Yaowei Wang
Chen Chen
220
15
0
07 Mar 2022
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context
European Conference on Computer Vision (ECCV), 2022
Pinaki Nath Chowdhury
Aneeshan Sain
A. Bhunia
Tao Xiang
Yulia Gryaditskaya
Yi-Zhe Song
3DV
337
66
0
04 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
212
41
0
03 Mar 2022
Multi-modal Alignment using Representation Codebook
Computer Vision and Pattern Recognition (CVPR), 2022
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
511
79
0
28 Feb 2022
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Peter Schaldenbrand
Zhixuan Liu
Jean Oh
CLIP
194
50
0
24 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
Computer Vision and Pattern Recognition (CVPR), 2022
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViT
VLM
765
637
0
22 Feb 2022
Vision-Language Pre-Training with Triple Contrastive Learning
Computer Vision and Pattern Recognition (CVPR), 2022
Jinyu Yang
Jiali Duan
Son N. Tran
Yi Xu
Sampath Chanda
Liqun Chen
Belinda Zeng
Trishul Chilimbi
Junzhou Huang
VLM
575
358
0
21 Feb 2022
On Guiding Visual Attention with Language Specification
Computer Vision and Pattern Recognition (CVPR), 2022
Suzanne Petryk
Lisa Dunlap
Keyan Nasseri
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
VLM
415
39
1
17 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Knowledge Discovery and Data Mining (KDD), 2022
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
263
44
0
15 Feb 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Neural Information Processing Systems (NeurIPS), 2022
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
410
138
0
14 Feb 2022
I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Ziyang Luo
Zhipeng Hu
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
187
29
0
14 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2022
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Yixiang Chen
Xuwu Wang
Yanghua Xiao
N. Yuan
211
238
0
11 Feb 2022
Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm
Muhy Eddin Za'ter
Bashar Talafha
VLM
199
6
0
11 Feb 2022
Keyword localisation in untranscribed speech using visually grounded speech models
IEEE Journal on Selected Topics in Signal Processing (IEEE JSTSP), 2022
Kayode Olaleye
Dan Oneaţă
Herman Kamper
206
7
0
02 Feb 2022
Deep Learning Approaches on Image Captioning: A Review
ACM Computing Surveys (ACM CSUR), 2022
Taraneh Ghandi
H. Pourreza
H. Mahyar
VLM
487
155
0
31 Jan 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
MLLM
244
19
0
30 Jan 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
ACM Multimedia (ACM MM), 2022
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
245
23
0
29 Jan 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
International Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
1.4K
5,888
0
28 Jan 2022
Previous
1
2
3
...
18
19
20
...
25
26
27
Next
Page 19 of 27
Page
of 27
Go