Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
Scene Graph Based Fusion Network For Image-Text Retrieval
IEEE International Conference on Multimedia and Expo (ICME), 2023
Guoliang Wang
Yanlei Shang
Yongzhe Chen
165
3
0
20 Mar 2023
Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening
Min Cao
Yang Bai
Wenwen Qiang
Ziqiang Cao
Liqiang Nie
Min Zhang
203
4
0
14 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
335
100
0
13 Mar 2023
Learning Combinatorial Prompts for Universal Controllable Image Captioning
International Journal of Computer Vision (IJCV), 2023
Zhen Wang
Jun Xiao
Yueting Zhuang
Fei Gao
Jian Shao
Long Chen
200
12
0
11 Mar 2023
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Zhiwei Zhang
Yuliang Liu
MLLM
375
0
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
International Conference on Learning Representations (ICLR), 2023
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
418
98
0
10 Mar 2023
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
European Conference on Computer Vision (ECCV), 2023
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
...
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
808
3,361
0
09 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
154
1
0
09 Mar 2023
Knowledge-Based Counterfactual Queries for Visual Question Answering
Theodoti Stoikou
Maria Lymperaiou
Giorgos Stamou
AAML
173
1
0
05 Mar 2023
Connecting Vision and Language with Video Localized Narratives
Computer Vision and Pattern Recognition (CVPR), 2023
P. Voigtlaender
Soravit Changpinyo
Jordi Pont-Tuset
Radu Soricut
V. Ferrari
VGen
308
30
0
22 Feb 2023
Test-Time Distribution Normalization for Contrastively Learned Vision-language Models
Neural Information Processing Systems (NeurIPS), 2023
Yi Zhou
Juntao Ren
Fengyu Li
Ramin Zabih
Ser-Nam Lim
VLM
250
21
0
22 Feb 2023
Few-shot Multimodal Multitask Multilingual Learning
Vasu Sharma
Vinija Jain
223
0
0
19 Feb 2023
Multimodal Federated Learning via Contrastive Representation Ensemble
International Conference on Learning Representations (ICLR), 2023
Qiying Yu
Yang Liu
Yimu Wang
Ke Xu
Jingjing Liu
177
126
0
17 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
232
8
0
16 Feb 2023
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Computer Vision and Pattern Recognition (CVPR), 2023
Jiang Liu
Hui Ding
Zhaowei Cai
Yuting Zhang
R. Satzoda
Vijay Mahadevan
R. Manmatha
ObjD
344
182
0
14 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Journal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
359
65
0
14 Feb 2023
Symbolic Discovery of Optimization Algorithms
Neural Information Processing Systems (NeurIPS), 2023
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
821
523
0
13 Feb 2023
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
International Conference on Learning Representations (ICLR), 2023
Haoyu Lu
Yuqi Huo
Guoxing Yang
Zhiwu Lu
Wei Zhan
Masayoshi Tomizuka
Mingyu Ding
185
54
0
13 Feb 2023
Towards Local Visual Modeling for Image Captioning
Pattern Recognition (Pattern Recogn.), 2023
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Weihao Ye
Rongrong Ji
ViT
242
107
0
13 Feb 2023
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Zhu Wang
Sourav Medya
Sathya Ravi
VLM
251
1
0
11 Feb 2023
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Zhuolin Yang
Ming-Yu Liu
Zihan Liu
V. Korthikanti
Weili Nie
...
Yuke Zhu
Mohammad Shoeybi
Bryan Catanzaro
Chaowei Xiao
Anima Anandkumar
VLM
RALM
208
53
0
09 Feb 2023
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval
Ziyang Luo
Pu Zhao
Can Xu
Xiubo Geng
Tao Shen
Chongyang Tao
Jing Ma
Qingwen Lin
Daxin Jiang
VLM
CLIP
178
3
0
06 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
International Conference on Machine Learning (ICML), 2023
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
273
221
0
01 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
342
53
0
01 Feb 2023
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Chen Chen
Bowen Zhang
Liangliang Cao
Jiguang Shen
Tom Gunter
Albin Madappally Jose
Alexander Toshev
Jonathon Shlens
Ruoming Pang
Yinfei Yang
VLM
3DV
268
27
0
30 Jan 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
International Conference on Machine Learning (ICML), 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
1.3K
6,781
0
30 Jan 2023
Improving Cross-modal Alignment for Text-Guided Image Inpainting
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Yucheng Zhou
Guodong Long
272
31
0
26 Jan 2023
OvarNet: Towards Open-vocabulary Object Attribute Recognition
Computer Vision and Pattern Recognition (CVPR), 2023
Keyan Chen
Xiaolong Jiang
Yao Hu
Xu Tang
Yan Gao
Jianqi Chen
Weidi Xie
VLM
ObjD
181
55
0
23 Jan 2023
MTTN: Multi-Pair Text to Text Narratives for Prompt Generation
Archan Ghosh
Debgandhar Ghosh
Madhurima Maji
Suchinta Chanda
Kalporup Goswami
229
1
0
21 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Computer Vision and Pattern Recognition (CVPR), 2023
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
354
6
0
19 Jan 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
IEEE transactions on multimedia (IEEE TMM), 2023
Xiaofeng Yang
Fayao Liu
Guosheng Lin
VLM
99
15
0
18 Jan 2023
Learning Customized Visual Models with Retrieval-Augmented Knowledge
Computer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Kilho Son
Jianwei Yang
Ce Liu
Jianfeng Gao
Yong Jae Lee
Chunyuan Li
VLM
234
77
0
17 Jan 2023
GLIGEN: Open-Set Grounded Text-to-Image Generation
Computer Vision and Pattern Recognition (CVPR), 2023
Yuheng Li
Haotian Liu
Qingyang Wu
Fangzhou Mu
Jianwei Yang
Jianfeng Gao
Chunyuan Li
Yong Jae Lee
VLM
436
807
1
17 Jan 2023
RILS: Masked Visual Reconstruction in Language Semantic Space
Computer Vision and Pattern Recognition (CVPR), 2023
Shusheng Yang
Yixiao Ge
Kun Yi
Dian Li
Ying Shan
Xiaohu Qie
Xinggang Wang
CLIP
194
14
0
17 Jan 2023
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Computer Vision and Pattern Recognition (CVPR), 2023
Filip Radenovic
Abhimanyu Dubey
Abhishek Kadian
Todor Mihaylov
Simon Vandenhende
Yash J. Patel
Y. Wen
Vignesh Ramanathan
D. Mahajan
VLM
351
102
0
05 Jan 2023
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
IEEE International Conference on Computer Vision (ICCV), 2022
Woohyun Kang
Jonghwan Mun
Sungjun Lee
Byungseok Roh
VLM
254
28
0
27 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Computer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
299
331
0
21 Dec 2022
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval
IEEE transactions on multimedia (IEEE TMM), 2022
Jie Guo
Meiting Wang
Yan Zhou
Bin Song
Yuhao Chi
Wei-liang Fan
Jianglong Chang
201
30
0
16 Dec 2022
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Letitia Parcalabescu
Anette Frank
236
49
0
15 Dec 2022
FlexiViT: One Model for All Patch Sizes
Computer Vision and Pattern Recognition (CVPR), 2022
Lucas Beyer
Pavel Izmailov
Alexander Kolesnikov
Mathilde Caron
Simon Kornblith
Xiaohua Zhai
Matthias Minderer
Michael Tschannen
Ibrahim Alabdulmohsin
Filip Pavetić
VLM
429
142
0
15 Dec 2022
Retrieval-based Disentangled Representation Learning with Natural Language Supervision
International Conference on Learning Representations (ICLR), 2022
Jiawei Zhou
Xiaoguang Li
Lifeng Shang
Xin Jiang
Qun Liu
Lei Chen
DRL
282
10
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2022
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
251
40
0
14 Dec 2022
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Haoxuan You
Rui Sun
Zhecan Wang
Kai-Wei Chang
Shih-Fu Chang
131
7
0
14 Dec 2022
ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Ahmed Abdelreheem
Kyle Olszewski
Hsin-Ying Lee
Peter Wonka
Panos Achlioptas
3DPC
267
33
0
12 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Computer Vision and Image Understanding (CVIU), 2022
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIP
CoGe
276
15
0
08 Dec 2022
Group Generalized Mean Pooling for Vision Transformer
ByungSoo Ko
Han-Gyu Kim
Byeongho Heo
Sangdoo Yun
Sanghyuk Chun
Geonmo Gu
Wonjae Kim
ViT
307
3
0
08 Dec 2022
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset
Sidra Hanif
Longin Jan Latecki
194
0
0
01 Dec 2022
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Computer Vision and Pattern Recognition (CVPR), 2022
Dongwon Kim
Nam-Won Kim
Suha Kwak
531
66
0
30 Nov 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
AAAI Conference on Artificial Intelligence (AAAI), 2022
Siyi Liu
Yaoyuan Liang
Feng Li
Shijia Huang
Hao Zhang
Hang Su
Jun Zhu
Lei Zhang
ObjD
305
40
0
28 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Tao Gui
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjD
VLM
160
1
0
28 Nov 2022
Previous
1
2
3
...
15
16
17
...
25
26
27
Next
Page 16 of 27
Page
of 27
Go