v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

LAVIS: A Library for Language-Vision Intelligence

Silvio Savarese

337

15 Sep 2022

OmniVL:One Foundation Model for Image-Language and Video-Language TasksNeural Information Processing Systems (NeurIPS), 2022

Zuxuan Wu

Lu Yuan

294

178

15 Sep 2022

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkIET Computer Vision (ICV), 2022

157

10 Sep 2022

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys (ACM CSUR), 2022

Paul Pu Liang

Amir Zadeh

Louis-Philippe Morency

315

169

07 Sep 2022

Statistical Foundation Behind Machine Learning and Its Impact on Computer Vision

Lei Zhang

H. Shum

VLM SSL

144

06 Sep 2022

Design of the topology for contrastive visual-textual alignment

Zhun Sun

376

05 Sep 2022

RLIP: Relational Language-Image Pre-training for Human-Object Interaction DetectionNeural Information Processing Systems (NeurIPS), 2022

374

05 Sep 2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentBritish Machine Vision Conference (BMVC), 2022

311

29 Aug 2022

MuMUR : Multilingual Multimodal Universal Retrieval

Avinash Madasu

Estelle Aflalo

Gabriela Ben-Melech Stan

Shachar Rosenman

Shao-Yen Tseng

Gedas Bertasius

Vasudev Lal

430

24 Aug 2022

Learning More May Not Be Better: Knowledge Transferability in Vision and Language TasksJournal of Imaging (JI), 2022

139

23 Aug 2022

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

...

640

707

22 Aug 2022

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalEuropean Conference on Computer Vision (ECCV), 2022

Errui Ding

Jingdong Wang

199

21 Aug 2022

VLMAE: Vision-Language Masked Autoencoder

205

19 Aug 2022

Multimodal foundation models are better simulators of the human brain

Mingyu Ding

...

183

17 Aug 2022

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-trainingEuropean Conference on Computer Vision (ECCV), 2022

212

08 Aug 2022

Fine-Grained Semantically Aligned Vision-Language Pre-TrainingNeural Information Processing Systems (NeurIPS), 2022

209

100

04 Aug 2022

Masked Vision and Language Modeling for Multi-modal Representation LearningInternational Conference on Learning Representations (ICLR), 2022

257

03 Aug 2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual SemanticsInternational Conference on Pattern Recognition (ICPR), 2022

186

31 Jul 2022

Curriculum Learning for Data-Efficient Vision-Language Alignment

Tejas Srinivasan

Xiang Ren

Jesse Thomason

VLM

156

29 Jul 2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-trainingEuropean Conference on Computer Vision (ECCV), 2022

Lu Yuan

225

26 Jul 2022

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language ModelsNeural Information Processing Systems (NeurIPS), 2022

Gabriel Stanovsky

219

25 Jul 2022

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language ExplanationsACM Multimedia (ACM MM), 2022

Qian Yang

Yunxin Li

Baotian Hu

Lin Ma

Yuxin Ding

Min Zhang

240

23 Jul 2022

Rethinking the Reference-based Distinctive Image CaptioningACM Multimedia (ACM MM), 2022

228

22 Jul 2022

Don't Stop Learning: Towards Continual Learning for the CLIP Model

Yuxuan Ding

Lingqiao Liu

226

19 Jul 2022

Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

251

18 Jul 2022

FashionViL: Fashion-Focused Vision-and-Language Representation LearningEuropean Conference on Computer Vision (ECCV), 2022

Li Zhang

192

17 Jul 2022

LineCap: Line Charts for Data Visualization Captioning ModelsVisual .. (VISUAL), 2022

196

15 Jul 2022

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Zhuo Li

194

05 Jul 2022

Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

198

02 Jul 2022

Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsComputer Vision and Pattern Recognition (CVPR), 2022

Ziyan Yang

Kushal Kafle

Franck Dernoncourt

Vicente Ordónez Román

VLM

422

30 Jun 2022

Towards Adversarial Attack on Vision-Language Pre-training ModelsACM Multimedia (ACM MM), 2022

303

155

19 Jun 2022

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsNeural Information Processing Systems (NeurIPS), 2022

Tal Shaharabany

Yoad Tewel

Lior Wolf

ObjD

255

19 Jun 2022

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMixInternational Conference on Machine Learning (ICML), 2022

Ran Cheng

Ping Luo

209

17 Jun 2022

MixGen: A New Multi-Modal Data Augmentation

399

122

16 Jun 2022

RefCrowd: Grounding the Target in Crowd with Referring ExpressionsACM Multimedia (ACM MM), 2022

Qingbo Wu

Fanman Meng

ObjD

218

16 Jun 2022

Image Captioning based on Feature Refinement and Reflective Decoding

157

16 Jun 2022

Multimodal Dialogue State TrackingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022

Hung Le

Nancy F. Chen

Guosheng Lin

160

16 Jun 2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022

...

296

152

15 Jun 2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Wanli Ouyang

242

14 Jun 2022

GLIPv2: Unifying Localization and Vision-Language Understanding

Lu Yuan

296

354

12 Jun 2022

A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training

Xuanjing Huang

155

11 Jun 2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsNeural Information Processing Systems (NeurIPS), 2022

310

09 Jun 2022

VL-BEiT: Generative Vision-Language Pretraining

180

02 Jun 2022

VALHALLA: Visual Hallucination for Machine TranslationComputer Vision and Pattern Recognition (CVPR), 2022

458

31 May 2022

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

308

30 May 2022

CyCLIP: Cyclic Contrastive Language-Image PretrainingNeural Information Processing Systems (NeurIPS), 2022

522

166

28 May 2022

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

227

24 May 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

...

Ji Zhang

Jingren Zhou

281

270

24 May 2022

Charon: a FrameNet Annotation Tool for Multimodal CorporaLaw (LAW), 2022

102

24 May 2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Yuan Yao

Qi-An Chen

Ao Zhang

Wei Ji

Zhiyuan Liu

Tat-Seng Chua

Maosong Sun

VLM MLLM

256

23 May 2022