Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1505.04870
Cited By

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
Richer Image-to-Sentence Models

v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Christopher M. Cervantes

Juan C. Caicedo

Anjali Narayan-Chen

Svetlana Lazebnik

ArXiv (abs)PDF HTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Vision Language Transformers: A Survey

Vision Language Transformers: A Survey

182

7

0

06 Jul 2023

ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models

ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models

Uddeshya Upadhyay

Shyamgopal Karthik

471

6

0

01 Jul 2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

119

4

0

29 Jun 2023

Benchmarking Zero-Shot Recognition with Vision-Language Models:
Challenges on Granularity and Specificity

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

355

14

0

28 Jun 2023

$CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy$

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \

10,000 Budget; An Extra \

4,000 Unlocks 81.8% Accuracy

Cihang Xie

283

25

0

27 Jun 2023

Approximated Prompt Tuning for Vision-Language Pre-trained Models

Approximated Prompt Tuning for Vision-Language Pre-trained Models

Qiong Wu

127

2

0

27 Jun 2023

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

464

817

0

27 Jun 2023

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding Multimodal Large Language Models to the WorldInternational Conference on Learning Representations (ICLR), 2023

404

1,039

0

26 Jun 2023

Localized Text-to-Image Generation for Free via Cross Attention Control

Localized Text-to-Image Generation for Free via Cross Attention Control

Ruslan Salakhutdinov

J. Zico Kolter

167

28

0

26 Jun 2023

Improving Reference-based Distinctive Image Captioning with Contrastive
Rewards

Improving Reference-based Distinctive Image Captioning with Contrastive Rewards

210

10

0

25 Jun 2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023

103

6

0

25 Jun 2023

DesCo: Learning Object Recognition with Rich Language Descriptions

DesCo: Learning Object Recognition with Rich Language DescriptionsNeural Information Processing Systems (NeurIPS), 2023

Liunian Harold Li

189

29

0

24 Jun 2023

A Survey on Multimodal Large Language Models

A Survey on Multimodal Large Language ModelsNational Science Review (NSR), 2023

Enhong Chen

463

1,022

0

23 Jun 2023

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelInternational Conference on Learning Representations (ICLR), 2023

214

11

0

15 Jun 2023

World-to-Words: Grounded Open Vocabulary Acquisition through Fast
Mapping in Vision-Language Models

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

215

12

0

14 Jun 2023

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language
Representations

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Radu Timofte

163

6

0

14 Jun 2023

GeneCIS: A Benchmark for General Conditional Image Similarity

GeneCIS: A Benchmark for General Conditional Image SimilarityComputer Vision and Pattern Recognition (CVPR), 2023

249

43

0

13 Jun 2023

I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models

I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models

233

24

0

13 Jun 2023

Top-Down Framework for Weakly-supervised Grounded Image Captioning

Top-Down Framework for Weakly-supervised Grounded Image Captioning

Yi Wang

235

5

0

13 Jun 2023

Retrieval-Enhanced Contrastive Vision-Text Models

Retrieval-Enhanced Contrastive Vision-Text ModelsInternational Conference on Learning Representations (ICLR), 2023

Cordelia Schmid

296

39

0

12 Jun 2023

Global and Local Semantic Completion Learning for Vision-Language
Pre-training

Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Wenzhe Zhao

Hongfa Wang

Yujiu Yang

Wei Liu

253

8

0

12 Jun 2023

Sticker820K: Empowering Interactive Retrieval with Stickers

Sticker820K: Empowering Interactive Retrieval with Stickers

Sijie Zhao

Ying Shan

114

14

0

12 Jun 2023

Read, look and detect: Bounding box annotation from image-caption pairs

Read, look and detect: Bounding box annotation from image-caption pairs

165

2

0

09 Jun 2023

Multimodal Explainable Artificial Intelligence: A Comprehensive Review
of Methodological Advances and Future Research Directions

Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research DirectionsIEEE Access (IEEE Access), 2023

Christos Sardianos

Panagiotis I. Radoglou-Grammatikis

Panagiotis G. Sarigiannidis

Iraklis Varlamis

Georgios Th. Papadopoulos

337

42

0

09 Jun 2023

Dealing with Semantic Underspecification in Multimodal NLP

Dealing with Semantic Underspecification in Multimodal NLPAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Sandro Pezzelle

169

11

0

08 Jun 2023

ScaleDet: A Scalable Multi-Dataset Object Detector

ScaleDet: A Scalable Multi-Dataset Object DetectorComputer Vision and Pattern Recognition (CVPR), 2023

177

27

0

08 Jun 2023

Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages

Zambezi Voice: A Multilingual Speech Corpus for Zambian LanguagesInterspeech (Interspeech), 2023

Claytone Sikasote

Kalinda Siaminwe

Mayumbo Nyirenda

Antonios Anastasopoulos

264

10

0

07 Jun 2023

Referring Expression Comprehension Using Language Adaptive Inference

Referring Expression Comprehension Using Language Adaptive InferenceAAAI Conference on Artificial Intelligence (AAAI), 2023

Xi Li

254

31

0

06 Jun 2023

GRES: Generalized Referring Expression Segmentation

GRES: Generalized Referring Expression SegmentationComputer Vision and Pattern Recognition (CVPR), 2023

337

247

0

01 Jun 2023

Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual PromptingIEEE International Joint Conference on Neural Network (IJCNN), 2023

Qiong Wu

Rongsheng Zhang

128

2

0

01 Jun 2023

Too Large; Data Reduction for Vision-Language Pre-Training

Too Large; Data Reduction for Vision-Language Pre-TrainingIEEE International Conference on Computer Vision (ICCV), 2023

Alex Jinpeng Wang

Kevin Qinghong Lin

David Junhao Zhang

Stan Weixian Lei

Mike Zheng Shou

335

31

0

31 May 2023

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
Models

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL ModelsNeural Information Processing Systems (NeurIPS), 2023

...

Leonid Karlinsky

387

73

0

31 May 2023

DisCLIP: Open-Vocabulary Referring Expression Generation

DisCLIP: Open-Vocabulary Referring Expression GenerationBritish Machine Vision Conference (BMVC), 2023

261

9

0

30 May 2023

Learning without Forgetting for Vision-Language Models

Learning without Forgetting for Vision-Language ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Da-Wei Zhou

Jingyi Ning

De-Chuan Zhan

Ziwei Liu

387

78

0

30 May 2023

Controllable Text-to-Image Generation with GPT-4

Controllable Text-to-Image Generation with GPT-4

Tianjun Zhang

347

61

0

29 May 2023

Contextual Object Detection with Multimodal Large Language Models

Contextual Object Detection with Multimodal Large Language ModelsInternational Journal of Computer Vision (IJCV), 2023

Chen Change Loy

328

142

0

29 May 2023

TaleCrafter: Interactive Story Visualization with Multiple Characters

TaleCrafter: Interactive Story Visualization with Multiple CharactersACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2023

Xiaodong Cun

...

Yong Zhang

Ying Shan

Yujiu Yang

351

65

0

29 May 2023

Improved Probabilistic Image-Text Representations

Improved Probabilistic Image-Text RepresentationsInternational Conference on Learning Representations (ICLR), 2023

604

43

0

29 May 2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
Dataset

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetNeural Information Processing Systems (NeurIPS), 2023

515

174

0

29 May 2023

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2023

322

39

0

29 May 2023

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge
Interaction Graph for Lightweight Text-Image Retrieval

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Chengyu Wang

Lianwen Jin

252

9

0

28 May 2023

Z-GMOT: Zero-shot Generic Multiple Object Tracking

Z-GMOT: Zero-shot Generic Multiple Object Tracking

Kim Hoang Tran

Anh Duy Le Dinh

Tien-Phat Nguyen

Gianfranco Doretto

Ngan Hoang Le

295

10

0

28 May 2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

PuMer: Pruning and Merging Tokens for Efficient Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Bhargavi Paranjape

Hannaneh Hajishirzi

173

51

0

27 May 2023

BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

BIG-C: a Multimodal Multi-Purpose Dataset for BembaAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Claytone Sikasote

Md Mahfuz Ibn Alam

Antonios Anastasopoulos

176

8

0

26 May 2023

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Three Towers: Flexible Contrastive Learning with Pretrained Image ModelsNeural Information Processing Systems (NeurIPS), 2023

Andreas Steiner

Rodolphe Jenatton

Efi Kokiopoulou

216

18

0

26 May 2023

Learning to Imagine: Visually-Augmented Natural Language Generation

Learning to Imagine: Visually-Augmented Natural Language GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

427

10

0

26 May 2023

ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Jing Liu

330

69

0

25 May 2023

Weakly Supervised Vision-and-Language Pre-training with Relative
Representations

Weakly Supervised Vision-and-Language Pre-training with Relative RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Peng Li

Maosong Sun

Yang Liu

152

2

0

24 May 2023

Visual Programming for Text-to-Image Generation and Evaluation

Visual Programming for Text-to-Image Generation and Evaluation

Joey Tianyi Zhou

390

55

0

24 May 2023

Pento-DIARef: A Diagnostic Dataset for Learning the Incremental
Algorithm for Referring Expression Generation from Examples

Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from ExamplesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

David Schlangen

132

3

0

24 May 2023

1 2 3...13 14 15...25 26 27

Page 14 of 27

Pageof 27