v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Scene Graph Based Fusion Network For Image-Text RetrievalIEEE International Conference on Multimedia and Expo (ICME), 2023

Guoliang Wang

Yanlei Shang

Yongzhe Chen

165

20 Mar 2023

Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening

Min Zhang

203

14 Mar 2023

Scaling Vision-Language Models with Sparse Mixture of ExpertsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yuxiong He

335

100

13 Mar 2023

Learning Combinatorial Prompts for Universal Controllable Image CaptioningInternational Journal of Computer Vision (IJCV), 2023

Zhen Wang

Jun Xiao

Yueting Zhuang

Fei Gao

Jian Shao

Long Chen

200

11 Mar 2023

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

Zhiwei Zhang

Yuliang Liu

MLLM

375

10 Mar 2023

Tag2Text: Guiding Vision-Language Model via Image TaggingInternational Conference on Learning Representations (ICLR), 2023

Xinyu Huang

Youcai Zhang

Jinyu Ma

Weiwei Tian

Rui Feng

Lei Zhang

418

10 Mar 2023

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object DetectionEuropean Conference on Computer Vision (ECCV), 2023

...

Jianwei Yang

Hang Su

Jun Zhu

Lei Zhang

ObjD

808

3,361

09 Mar 2023

Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training

154

09 Mar 2023

Knowledge-Based Counterfactual Queries for Visual Question Answering

Theodoti Stoikou

Maria Lymperaiou

Giorgos Stamou

AAML

173

05 Mar 2023

Connecting Vision and Language with Video Localized NarrativesComputer Vision and Pattern Recognition (CVPR), 2023

308

22 Feb 2023

Test-Time Distribution Normalization for Contrastively Learned Vision-language ModelsNeural Information Processing Systems (NeurIPS), 2023

Ser-Nam Lim

250

22 Feb 2023

Few-shot Multimodal Multitask Multilingual Learning

Vasu Sharma

Vinija Jain

223

19 Feb 2023

Multimodal Federated Learning via Contrastive Representation EnsembleInternational Conference on Learning Representations (ICLR), 2023

Yang Liu

177

126

17 Feb 2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

232

16 Feb 2023

PolyFormer: Referring Image Segmentation as Sequential Polygon GenerationComputer Vision and Pattern Recognition (CVPR), 2023

344

182

14 Feb 2023

Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023

359

14 Feb 2023

Symbolic Discovery of Optimization AlgorithmsNeural Information Processing Systems (NeurIPS), 2023

...

821

523

13 Feb 2023

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingInternational Conference on Learning Representations (ICLR), 2023

Wei Zhan

Mingyu Ding

185

13 Feb 2023

Towards Local Visual Modeling for Image CaptioningPattern Recognition (Pattern Recogn.), 2023

Jiayi Ji

242

107

13 Feb 2023

Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

251

11 Feb 2023

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

...

208

09 Feb 2023

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

178

06 Feb 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoInternational Conference on Machine Learning (ICML), 2023

Jiabo Ye

...

Ji Zhang

Jingren Zhou

273

221

01 Feb 2023

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Muhammad Arslan Manzoor

342

01 Feb 2023

STAIR: Learning Sparse Text and Image Representation in Grounded TokensConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Chen Chen

Albin Madappally Jose

268

30 Jan 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023

Silvio Savarese

1.3K

6,781

30 Jan 2023

Improving Cross-modal Alignment for Text-Guided Image InpaintingConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Yucheng Zhou

Guodong Long

272

26 Jan 2023

OvarNet: Towards Open-vocabulary Object Attribute RecognitionComputer Vision and Pattern Recognition (CVPR), 2023

Keyan Chen

Xiaolong Jiang

Yao Hu

Xu Tang

181

23 Jan 2023

MTTN: Multi-Pair Text to Text Narratives for Prompt Generation

229

21 Jan 2023

Masked Autoencoding Does Not Help Natural Language Supervision at ScaleComputer Vision and Pattern Recognition (CVPR), 2023

Floris Weers

Vaishaal Shankar

Angelos Katharopoulos

Yinfei Yang

Tom Gunter

CLIP

354

19 Jan 2023

Effective End-to-End Vision Language Pretraining with Semantic Visual LossIEEE transactions on multimedia (IEEE TMM), 2023

Xiaofeng Yang

Fayao Liu

Guosheng Lin

VLM

18 Jan 2023

Learning Customized Visual Models with Retrieval-Augmented KnowledgeComputer Vision and Pattern Recognition (CVPR), 2023

Jianwei Yang

234

17 Jan 2023

GLIGEN: Open-Set Grounded Text-to-Image GenerationComputer Vision and Pattern Recognition (CVPR), 2023

Jianwei Yang

436

807

17 Jan 2023

RILS: Masked Visual Reconstruction in Language Semantic SpaceComputer Vision and Pattern Recognition (CVPR), 2023

Shusheng Yang

Ying Shan

194

17 Jan 2023

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2023

351

102

05 Jan 2023

Noise-aware Learning from Web-crawled Image-Text Data for Image CaptioningIEEE International Conference on Computer Vision (ICCV), 2022

254

27 Dec 2022

Generalized Decoding for Pixel, Image, and LanguageComputer Vision and Pattern Recognition (CVPR), 2022

Jianwei Yang

...

Lu Yuan

299

331

21 Dec 2022

HGAN: Hierarchical Graph Alignment Network for Image-Text RetrievalIEEE transactions on multimedia (IEEE TMM), 2022

201

16 Dec 2022

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Letitia Parcalabescu

Anette Frank

236

15 Dec 2022

FlexiViT: One Model for All Patch SizesComputer Vision and Pattern Recognition (CVPR), 2022

Ibrahim Alabdulmohsin

Filip Pavetić

VLM

429

142

15 Dec 2022

Retrieval-based Disentangled Representation Learning with Natural Language SupervisionInternational Conference on Learning Representations (ICLR), 2022

Jiawei Zhou

Xiaoguang Li

Lifeng Shang

Xin Jiang

Qun Liu

Lei Chen

DRL

282

15 Dec 2022

NLIP: Noise-robust Language-Image Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2022

Runhu Huang

Yanxin Long

Jianhua Han

Hang Xu

Xiwen Liang

Chunjing Xu

Xiaodan Liang

VLM

251

14 Dec 2022

Find Someone Who: Visual Commonsense Understanding in Human-Centric GroundingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

131

14 Dec 2022

ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D ScenesIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

Hsin-Ying Lee

267

12 Dec 2022

Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding (CVIU), 2022

276

08 Dec 2022

Group Generalized Mean Pooling for Vision Transformer

307

08 Dec 2022

Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset

Sidra Hanif

Longin Jan Latecki

194

01 Dec 2022

Improving Cross-Modal Retrieval with Set of Diverse EmbeddingsComputer Vision and Pattern Recognition (CVPR), 2022

Dongwon Kim

Nam-Won Kim

Suha Kwak

531

30 Nov 2022

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingAAAI Conference on Artificial Intelligence (AAAI), 2022

Hang Su

Jun Zhu

Lei Zhang

ObjD

305

28 Nov 2022

SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Ming-Ming Cheng

160

28 Nov 2022