Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
Learning Object-Language Alignments for Open-Vocabulary Object Detection
International Conference on Learning Representations (ICLR), 2022
Chuang Lin
Pei Sun
Yi Jiang
Ping Luo
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
Jianfei Cai
VLM
ObjD
200
118
0
27 Nov 2022
CLID: Controlled-Length Image Descriptions with Limited Data
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Elad Hirsch
A. Tal
VLM
3DV
219
5
0
27 Nov 2022
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding
AAAI Conference on Artificial Intelligence (AAAI), 2022
Meihuizi Jia
Lei Shen
Xin Shen
L. Liao
Meng Chen
Xiaodong He
Zhen-Heng Chen
Jiaqi Li
204
55
0
27 Nov 2022
Who are you referring to? Coreference resolution in image narrations
IEEE International Conference on Computer Vision (ICCV), 2022
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
272
5
0
26 Nov 2022
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
Neural Information Processing Systems (NeurIPS), 2022
Eslam Mohamed Bakr
Yasmeen Alsaedy
Mohamed Elhoseiny
3DPC
192
61
0
25 Nov 2022
Overcoming Catastrophic Forgetting by XAI
Giang Nguyen
229
0
0
25 Nov 2022
TPA-Net: Generate A Dataset for Text to Physics-based Animation
Yuxing Qiu
Feng Gao
Minchen Li
Govind Thattai
Yin Yang
Jian Ren
PINN
DiffM
VGen
195
0
0
25 Nov 2022
ComCLIP: Training-Free Compositional Image and Text Matching
North American Chapter of the Association for Computational Linguistics (NAACL), 2022
Kenan Jiang
Xuehai He
Ruize Xu
Xinze Wang
VLM
CLIP
CoGe
408
25
0
25 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Computer Vision and Pattern Recognition (CVPR), 2022
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
267
18
0
24 Nov 2022
X
2
^2
2
-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLM
MLLM
243
26
0
22 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
179
10
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
157
8
0
21 Nov 2022
Unifying Tracking and Image-Video Object Detection
Peirong Liu
Rui Wang
Pengchuan Zhang
Omid Poursaeed
Yipin Zhou
Xuefei Cao
Sreya . Dutta Roy
Ashish Shah
Ser-Nam Lim
189
0
0
20 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2022
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
194
11
0
20 Nov 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Computer Vision and Pattern Recognition (CVPR), 2022
Hao Li
Jinguo Zhu
Xiaohu Jiang
Xizhou Zhu
Jiaming Song
...
Xiaohua Wang
Yu Qiao
Xiaogang Wang
Wenhai Wang
Jifeng Dai
MLLM
171
68
0
17 Nov 2022
Will Large-scale Generative Models Corrupt Future Datasets?
IEEE International Conference on Computer Vision (ICCV), 2022
Ryuichiro Hataya
Han Bao
Hiromi Arai
245
71
0
15 Nov 2022
A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation
Shijia Huang
Feng Li
Hao Zhang
Siyi Liu
Lei Zhang
Liwei Wang
185
5
0
15 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
135
11
0
14 Nov 2022
Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and Analysis
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Changyuan Qiu
Winston Wu
Xinliang Frederick Zhang
Lu Wang
151
1
0
04 Nov 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
David Nukrai
Ron Mokady
Amir Globerson
VLM
CLIP
306
125
0
01 Nov 2022
Generative Negative Text Replay for Continual Vision-Language Pretraining
European Conference on Computer Vision (ECCV), 2022
Shipeng Yan
Lanqing Hong
Hang Xu
Jianhua Han
Tinne Tuytelaars
Zhenguo Li
Xuming He
VLM
CLL
CLIP
177
24
0
31 Oct 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Chandu
A. Geramifard
209
3
0
30 Oct 2022
A Survey on Causal Representation Learning and Future Work for Medical Image Analysis
Chang-Tien Lu
OOD
BDL
CML
MedIm
255
0
0
28 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
157
6
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
177
9
0
24 Oct 2022
Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization
Peter Schaldenbrand
Zhixuan Liu
Jean Oh
CLIP
198
0
0
23 Oct 2022
Extending Phrase Grounding with Pronouns in Visual Dialogues
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Panzhong Lu
Xin Zhang
Meishan Zhang
Min Zhang
ObjD
194
5
0
23 Oct 2022
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
IEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022
Yangfan Zhan
Zhitong Xiong
Yuan. Yuan
242
188
0
23 Oct 2022
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
Jiaming Chen
Weihua Luo
Ran Song
Xiaolin K. Wei
Lin Ma
Wei Emma Zhang
3DV
321
8
0
22 Oct 2022
Prophet Attention: Predicting Attention with Future Attention for Image Captioning
Neural Information Processing Systems (NeurIPS), 2022
Fenglin Liu
Xuancheng Ren
Xian Wu
Wei Fan
Yuexian Zou
Xu Sun
234
52
0
19 Oct 2022
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Neural Information Processing Systems (NeurIPS), 2022
Pengfei Li
Beiwen Tian
Yongliang Shi
Xiaoxue Chen
Hao Zhao
Guyue Zhou
Ya Zhang
262
29
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
190
24
0
19 Oct 2022
CPL: Counterfactual Prompt Learning for Vision and Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xuehai He
Diji Yang
Weixi Feng
Tsu-Jui Fu
Arjun Reddy Akula
Varun Jampani
P. Narayana
Sugato Basu
William Yang Wang
Xinze Wang
VPVLM
VLM
330
19
0
19 Oct 2022
Non-Contrastive Learning Meets Language-Image Pre-Training
Computer Vision and Pattern Recognition (CVPR), 2022
Jinghao Zhou
Li Dong
Zhe Gan
Lijuan Wang
Furu Wei
VLM
CLIP
218
33
0
17 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Neural Information Processing Systems (NeurIPS), 2022
Xuran Pan
Tianzhu Ye
Dongchen Han
Qing Xiao
Gao Huang
VLM
CLIP
191
64
0
17 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Workshop on Representation Learning for NLP (RepL4NLP), 2022
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
188
1
0
12 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Computer Vision and Pattern Recognition (CVPR), 2022
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
261
58
0
11 Oct 2022
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Findings (Findings), 2022
Pedro Rodriguez
Mahmoud Azab
Becka Silvert
Renato Sanchez
Linzy Labson
Hardik Shah
Seungwhan Moon
226
2
0
10 Oct 2022
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding
Spoken Language Technology Workshop (SLT), 2022
Kayode Olaleye
Dan Oneaţă
Herman Kamper
ObjD
226
8
0
10 Oct 2022
Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Ru Peng
Yawen Zeng
Jiaqi Zhao
240
24
0
10 Oct 2022
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Zijia Zhao
Longteng Guo
Xingjian He
Shuai Shao
Zehuan Yuan
Jing Liu
356
13
0
09 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
325
26
0
09 Oct 2022
Affection: Learning Affective Explanations for Real-World Visual Data
Computer Vision and Pattern Recognition (CVPR), 2022
Panos Achlioptas
M. Ovsjanikov
Leonidas Guibas
Sergey Tulyakov
183
26
0
04 Oct 2022
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach
Georgios Tziafas
Hamidreza Kasaei
LM&Ro
361
5
0
03 Oct 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
VLM
162
21
0
30 Sep 2022
MUG: Interactive Multimodal Grounding on User Interfaces
Findings (Findings), 2022
Tao Li
Gang Li
Jingjie Zheng
Purple Wang
Yang Li
LLMAG
186
11
0
29 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
230
49
0
28 Sep 2022
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
Neural Information Processing Systems (NeurIPS), 2022
Janghyeon Lee
Jongsuk Kim
Hyounguk Shon
Bumsoo Kim
Seung Wook Kim
Honglak Lee
Junmo Kim
CLIP
VLM
333
65
0
27 Sep 2022
DRAMA: Joint Risk Localization and Captioning in Driving
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Srikanth Malla
Chiho Choi
Isht Dwivedi
Joonhyang Choi
Jiachen Li
320
154
0
22 Sep 2022
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
Tomás Crisol
J. Ermantraut
Adrián Rostagno
Santiago L. Aggio
Javier Iparraguirre
151
0
0
21 Sep 2022
Previous
1
2
3
...
16
17
18
...
25
26
27
Next
Page 17 of 27
Page
of 27
Go