v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

A Survey on Visual Transfer Learning using Knowledge Graphs

Sebastian Monka

Lavdim Halilaj

Achim Rettinger

254

27 Jan 2022

PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy Labels

170

26 Jan 2022

Supervised Visual Attention for Simultaneous Multimodal Machine TranslationJournal of Artificial Intelligence Research (JAIR), 2022

223

23 Jan 2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal MatchingNeurocomputing (Neurocomputing), 2022

Hengcan Shi

Munawar Hayat

Jianfei Cai

ObjD

209

18 Jan 2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yingwei Pan

Tao Mei

222

11 Jan 2022

Semantically Grounded Visual Embeddings for Zero-Shot Learning

285

03 Jan 2022

Deconfounded Visual GroundingAAAI Conference on Artificial Intelligence (AAAI), 2021

Hanwang Zhang

203

31 Dec 2021

Grounding Linguistic Commands to Navigable RegionsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2021

214

24 Dec 2021

Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsEuropean Conference on Computer Vision (ECCV), 2021

444

497

22 Dec 2021

A Survey of Natural Language GenerationACM Computing Surveys (CSUR), 2021

Min Yang

336

22 Dec 2021

ScanQA: 3D Question Answering for Spatial Scene UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021

444

328

20 Dec 2021

Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

492

148

16 Dec 2021

Distilled Dual-Encoder Model for Vision-Language Understanding

214

16 Dec 2021

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

303

137

14 Dec 2021

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Zuxuan Wu

221

10 Dec 2021

FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh

Douwe Kiela

383

873

08 Dec 2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Zicheng Liu

167

08 Dec 2021

Grounded Language-Image Pre-training

Jianwei Yang

...

Lu Yuan

Lei Zhang

468

1,407

07 Dec 2021

From Coarse to Fine-grained Concept based Discrimination for Phrase Detection

Maan Qraitem

Bryan A. Plummer

ObjD

200

06 Dec 2021

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Dave Zhenyu Chen

Qirui Wu

Matthias Nießner

Angel X. Chang

196

02 Dec 2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

252

152

02 Dec 2021

Weakly-Supervised Video Object Grounding via Causal Intervention

325

01 Dec 2021

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Zicheng Liu

357

134

23 Nov 2021

Florence: A New Foundation Model for Computer Vision

Lu Yuan

...

Jianwei Yang

409

1,060

22 Nov 2021

Class-agnostic Object Detection with Multi-modal TransformerEuropean Conference on Computer Vision (ECCV), 2021

Salman Khan

Rao Muhammad Anwer

625

117

22 Nov 2021

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsInternational Conference on Machine Learning (ICML), 2021

345

356

16 Nov 2021

Memotion Analysis through the Lens of Joint EmbeddingAAAI Conference on Artificial Intelligence (AAAI), 2021

Nethra Gunti

Sathyanarayanan Ramamoorthy

Parth Patwa

Amitava Das

130

13 Nov 2021

FILIP: Fine-grained Interactive Language-Image Pre-TrainingInternational Conference on Learning Representations (ICLR), 2021

Hang Xu

Xiaodan Liang

Zhenguo Li

Xin Jiang

Chunjing Xu

VLM CLIP

343

769

09 Nov 2021

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval

186

05 Nov 2021

An Empirical Study of Training End-to-End Vision-and-Language TransformersComputer Vision and Pattern Recognition (CVPR), 2021

...

Lu Yuan

Zicheng Liu

302

438

03 Nov 2021

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-ExpertsNeural Information Processing Systems (NeurIPS), 2021

981

693

03 Nov 2021

Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

Sathishkumar Samiappan

ViT

152

24 Oct 2021

Text-Based Person Search with Limited DataBritish Machine Vision Conference (BMVC), 2021

Xiaoping Han

Sen He

Li Zhang

Tao Xiang

210

125

20 Oct 2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional VideosNeural Information Processing Systems (NeurIPS), 2021

213

20 Oct 2021

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal RetrievalKnowledge-Based Systems (KBS), 2021

220

20 Oct 2021

Towards Language-guided Visual Recognition via Dynamic Convolutions

Yongjian Wu

242

17 Oct 2021

Unsupervised Natural Language Inference Using PHL Triplet Generation

261

16 Oct 2021

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

459

06 Oct 2021

Learning Structural Representations for Recipe Generation and Food Retrieval

Hao Wang

Guosheng Lin

Chunyan Miao

157

04 Oct 2021

CIDEr-R: Robust Consensus-based Image Description Evaluation

G. O. D. Santos

Esther Luna Colombini

Sandra Avila

161

28 Sep 2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao

Ao Zhang

Zhengyan Zhang

Zhiyuan Liu

Tat-Seng Chua

Maosong Sun

MLLM VPVLM VLM

594

245

24 Sep 2021

Discovering and Validating AI Errors With Crowdsourced Failure Reports

Ángel Alexander Cabrera

181

23 Sep 2021

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

283

22 Sep 2021

Associative Memories via Predictive Coding

Lei Sha

196

16 Sep 2021

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

300

14 Sep 2021

xGQA: Cross-Lingual Visual Question Answering

362

13 Sep 2021

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

267

253

12 Sep 2021

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Xuanjing Huang

118

12 Sep 2021

Panoptic Narrative GroundingIEEE International Conference on Computer Vision (ICCV), 2021

258

10 Sep 2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Stella Frank

Emanuele Bugliarello

Desmond Elliott

190

09 Sep 2021