v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Learning Object-Language Alignments for Open-Vocabulary Object DetectionInternational Conference on Learning Representations (ICLR), 2022

Jianfei Cai

200

118

27 Nov 2022

CLID: Controlled-Length Image Descriptions with Limited DataIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

Elad Hirsch

A. Tal

VLM 3DV

219

27 Nov 2022

MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query GroundingAAAI Conference on Artificial Intelligence (AAAI), 2022

204

27 Nov 2022

Who are you referring to? Coreference resolution in image narrationsIEEE International Conference on Computer Vision (ICCV), 2022

272

26 Nov 2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual GroundingNeural Information Processing Systems (NeurIPS), 2022

192

25 Nov 2022

Overcoming Catastrophic Forgetting by XAI

Giang Nguyen

229

25 Nov 2022

TPA-Net: Generate A Dataset for Text to Physics-based Animation

Govind Thattai

195

25 Nov 2022

ComCLIP: Training-Free Compositional Image and Text MatchingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022

408

25 Nov 2022

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningComputer Vision and Pattern Recognition (CVPR), 2022

Wenzhe Zhao

Hongfa Wang

Yujiu Yang

Wei Liu

VLM

267

24 Nov 2022

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

^2

-VLM: All-In-One Pre-trained Model For Vision-Language TasksIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Hkust Wangchunshu Zhou

VLM MLLM

243

22 Nov 2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

179

21 Nov 2022

ClipCrop: Conditioned Cropping Driven by Vision-Language Model

Mingxi Cheng

Ji Li

Yoichi Sato

157

21 Nov 2022

Unifying Tracking and Image-Video Object Detection

Rui Wang

Ser-Nam Lim

189

20 Nov 2022

Leveraging per Image-Token Consistency for Vision-Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2022

194

20 Nov 2022

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2022

...

Yu Qiao

171

17 Nov 2022

Will Large-scale Generative Models Corrupt Future Datasets?IEEE International Conference on Computer Vision (ICCV), 2022

Ryuichiro Hataya

Han Bao

Hiromi Arai

245

15 Nov 2022

A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation

Lei Zhang

185

15 Nov 2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

Junyan Wang

Yi Zhang

Ming Yan

Ji Zhang

Jitao Sang

VLM

135

14 Nov 2022

Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and AnalysisConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Changyuan Qiu

Winston Wu

Xinliang Frederick Zhang

Lu Wang

151

04 Nov 2022

Text-Only Training for Image Captioning using Noise-Injected CLIPConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

306

125

01 Nov 2022

Generative Negative Text Replay for Continual Vision-Language PretrainingEuropean Conference on Computer Vision (ECCV), 2022

Shipeng Yan

Lanqing Hong

Hang Xu

Jianhua Han

Tinne Tuytelaars

Zhenguo Li

Xuming He

VLM CLL CLIP

177

31 Oct 2022

Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities

Khyathi Chandu

A. Geramifard

209

30 Oct 2022

A Survey on Causal Representation Learning and Future Work for Medical Image Analysis

Chang-Tien Lu

OOD BDL CML MedIm

255

28 Oct 2022

Learning by Hallucinating: Vision-Language Pre-training with Weak SupervisionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

157

24 Oct 2022

Towards Unifying Reference Expression Generation and ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

177

24 Oct 2022

Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization

Peter Schaldenbrand

Zhixuan Liu

Jean Oh

CLIP

198

23 Oct 2022

Extending Phrase Grounding with Pronouns in Visual DialoguesConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Min Zhang

194

23 Oct 2022

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing DataIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022

Yangfan Zhan

Zhitong Xiong

Yuan. Yuan

242

188

23 Oct 2022

Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

321

22 Oct 2022

Prophet Attention: Predicting Attention with Future Attention for Image CaptioningNeural Information Processing Systems (NeurIPS), 2022

Xuancheng Ren

Yuexian Zou

234

19 Oct 2022

TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun DistillationNeural Information Processing Systems (NeurIPS), 2022

Hao Zhao

262

19 Oct 2022

LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Hongcheng Guo

Jiaheng Liu

Haoyang Huang

Jian Yang

Zhoujun Li

Dongdong Zhang

Zheng Cui

Furu Wei

190

19 Oct 2022

CPL: Counterfactual Prompt Learning for Vision and Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

330

19 Oct 2022

Non-Contrastive Learning Meets Language-Image Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2022

218

17 Oct 2022

Contrastive Language-Image Pre-Training with Knowledge GraphsNeural Information Processing Systems (NeurIPS), 2022

Gao Huang

191

17 Oct 2022

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language TasksWorkshop on Representation Learning for NLP (RepL4NLP), 2022

188

12 Oct 2022

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelComputer Vision and Pattern Recognition (CVPR), 2022

Junjie Wang

Hongfa Wang

Yujiu Yang

261

11 Oct 2022

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval BenchmarksFindings (Findings), 2022

226

10 Oct 2022

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual groundingSpoken Language Technology Workshop (SLT), 2022

226

10 Oct 2022

Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Ru Peng

Yawen Zeng

Jiaqi Zhao

240

10 Oct 2022

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022

Jing Liu

356

09 Oct 2022

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

325

09 Oct 2022

Affection: Learning Affective Explanations for Real-World Visual DataComputer Vision and Pattern Recognition (CVPR), 2022

183

04 Oct 2022

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Georgios Tziafas

Hamidreza Kasaei

LM&Ro

361

03 Oct 2022

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

162

30 Sep 2022

MUG: Interactive Multimodal Grounding on User InterfacesFindings (Findings), 2022

186

29 Sep 2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual GroundingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

230

28 Sep 2022

UniCLIP: Unified Framework for Contrastive Language-Image Pre-trainingNeural Information Processing Systems (NeurIPS), 2022

333

27 Sep 2022

DRAMA: Joint Risk Localization and Captioning in DrivingIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

320

154

22 Sep 2022

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

151

21 Sep 2022