ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
LAVIS: A Library for Language-Vision Intelligence
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li
Junnan Li
Hung Le
Guangsen Wang
Silvio Savarese
Guosheng Lin
VLM
337
63
0
15 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
OmniVL:One Foundation Model for Image-Language and Video-Language TasksNeural Information Processing Systems (NeurIPS), 2022
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLMVLM
294
178
0
15 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with
  multimodal detection network
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkIET Computer Vision (ICV), 2022
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLMMLLMObjD
157
16
0
10 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles,
  Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys (ACM CSUR), 2022
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
315
169
0
07 Sep 2022
Statistical Foundation Behind Machine Learning and Its Impact on
  Computer Vision
Statistical Foundation Behind Machine Learning and Its Impact on Computer Vision
Lei Zhang
H. Shum
VLMSSL
144
2
0
06 Sep 2022
Design of the topology for contrastive visual-textual alignment
Design of the topology for contrastive visual-textual alignment
Zhun Sun
376
2
0
05 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object
  Interaction Detection
RLIP: Relational Language-Image Pre-training for Human-Object Interaction DetectionNeural Information Processing Systems (NeurIPS), 2022
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
374
76
0
05 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and
  Hierarchical Alignment
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentBritish Machine Vision Conference (BMVC), 2022
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLMCLIP
311
28
0
29 Aug 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
430
6
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and
  Language Tasks
Learning More May Not Be Better: Knowledge Transferability in Vision and Language TasksJournal of Imaging (JI), 2022
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
139
1
0
23 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLMVLMViT
640
707
0
22 Aug 2022
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
  Image-Text Retrieval
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Haoran Wang
Dongliang He
Wenhao Wu
Boyang Xia
Min Yang
Fu Li
YunLong Yu
Zhong Ji
Errui Ding
Jingdong Wang
199
27
0
21 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
205
11
0
19 Aug 2022
Multimodal foundation models are better simulators of the human brain
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
183
19
0
17 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
  Pre-training
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-trainingEuropean Conference on Computer Vision (ECCV), 2022
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
212
13
0
08 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Fine-Grained Semantically Aligned Vision-Language Pre-TrainingNeural Information Processing Systems (NeurIPS), 2022
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
209
100
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation LearningInternational Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
257
84
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual
  Semantics
Augmenting Vision Language Pretraining by Learning Codebook with Visual SemanticsInternational Conference on Pattern Recognition (ICPR), 2022
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
186
1
0
31 Jul 2022
Curriculum Learning for Data-Efficient Vision-Language Alignment
Curriculum Learning for Data-Efficient Vision-Language Alignment
Tejas Srinivasan
Xiang Ren
Jesse Thomason
VLM
156
11
0
29 Jul 2022
Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-trainingEuropean Conference on Computer Vision (ECCV), 2022
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIPVLM
225
56
0
26 Jul 2022
WinoGAViL: Gamified Association Benchmark to Challenge
  Vision-and-Language Models
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Yonatan Bitton
Nitzan Bitton-Guetta
Ron Yosef
Yuval Elovici
Joey Tianyi Zhou
Gabriel Stanovsky
Roy Schwartz
219
19
0
25 Jul 2022
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
  Natural Language Explanations
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language ExplanationsACM Multimedia (ACM MM), 2022
Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang
240
11
0
23 Jul 2022
Rethinking the Reference-based Distinctive Image Captioning
Rethinking the Reference-based Distinctive Image CaptioningACM Multimedia (ACM MM), 2022
Yangjun Mao
Long Chen
Zhihong Jiang
Dong Zhang
Zhimeng Zhang
Jian Shao
Jun Xiao
DiffM
228
23
0
22 Jul 2022
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Yuxuan Ding
Lingqiao Liu
Chunna Tian
Jingyuan Yang
Haoxuan Ding
CLLVLMKELM
226
72
0
19 Jul 2022
Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised
  Referring Expression Grounding
Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Xuejing Liu
Liang Li
Shuhui Wang
Zhengjun Zha
Dechao Meng
Qi Tian
Qingming Huang
251
74
0
18 Jul 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
FashionViL: Fashion-Focused Vision-and-Language Representation LearningEuropean Conference on Computer Vision (ECCV), 2022
Xiaoping Han
Licheng Yu
Xiatian Zhu
Li Zhang
Yi-Zhe Song
Tao Xiang
AI4TS
192
60
0
17 Jul 2022
LineCap: Line Charts for Data Visualization Captioning Models
LineCap: Line Charts for Data Visualization Captioning ModelsVisual .. (VISUAL), 2022
Anita Mahinpei
Zona Kostic
Christy Tanner
VLM
196
25
0
15 Jul 2022
Toward Explainable and Fine-Grained 3D Grounding through Referring
  Textual Phrases
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan
Xu Yan
Zhuo Li
Xuhao Li
Yao Guo
Shuguang Cui
Zhen Li
194
18
0
05 Jul 2022
Contrastive Cross-Modal Knowledge Sharing Pre-training for
  Vision-Language Representation Learning and Retrieval
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
Keyu Wen
Zhenshan Tan
Qingrong Cheng
Cheng Chen
X. Gu
VLM
198
1
0
02 Jul 2022
Improving Visual Grounding by Encouraging Consistent Gradient-based
  Explanations
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsComputer Vision and Pattern Recognition (CVPR), 2022
Ziyan Yang
Kushal Kafle
Franck Dernoncourt
Vicente Ordónez Román
VLM
422
32
0
30 Jun 2022
Towards Adversarial Attack on Vision-Language Pre-training Models
Towards Adversarial Attack on Vision-Language Pre-training ModelsACM Multimedia (ACM MM), 2022
Jiaming Zhang
Qiaomin Yi
Jitao Sang
VLMAAML
303
155
0
19 Jun 2022
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
  without Text Inputs
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsNeural Information Processing Systems (NeurIPS), 2022
Tal Shaharabany
Yoad Tewel
Lior Wolf
ObjD
255
23
0
19 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMixInternational Conference on Machine Learning (ICML), 2022
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
209
54
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
399
122
0
16 Jun 2022
RefCrowd: Grounding the Target in Crowd with Referring Expressions
RefCrowd: Grounding the Target in Crowd with Referring ExpressionsACM Multimedia (ACM MM), 2022
Heqian Qiu
Hongliang Li
Taijin Zhao
Lanxiao Wang
Qingbo Wu
Fanman Meng
ObjD
218
9
0
16 Jun 2022
Image Captioning based on Feature Refinement and Reflective Decoding
Image Captioning based on Feature Refinement and Reflective Decoding
G. Alabduljabbar
Hafida Benhidour
Said Kerrache
3DV
157
3
0
16 Jun 2022
Multimodal Dialogue State Tracking
Multimodal Dialogue State TrackingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022
Hung Le
Nancy F. Chen
Guosheng Lin
160
10
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLMObjD
296
152
0
15 Jun 2022
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
  Transformer
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Jiajun Deng
Zhengyuan Yang
Daqing Liu
Tianlang Chen
Wen-gang Zhou
Yanyong Zhang
Houqiang Li
Wanli Ouyang
ViT
242
90
0
14 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjDVLM
296
354
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge
  Discovery and Pre-training
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
155
6
0
11 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional
  MoEs
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsNeural Information Processing Systems (NeurIPS), 2022
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Jiaming Song
Xiaogang Wang
Jifeng Dai
MoMeMoE
310
84
0
09 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
180
48
0
02 Jun 2022
VALHALLA: Visual Hallucination for Machine Translation
VALHALLA: Visual Hallucination for Machine TranslationComputer Vision and Pattern Recognition (CVPR), 2022
Yi Li
Yikang Shen
Yoon Kim
Chun-Fu Chen
Rogerio Feris
David D. Cox
Nuno Vasconcelos
MLLM
458
51
0
31 May 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGeVLM
308
14
0
30 May 2022
CyCLIP: Cyclic Contrastive Language-Image Pretraining
CyCLIP: Cyclic Contrastive Language-Image PretrainingNeural Information Processing Systems (NeurIPS), 2022
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan Rossi
Vishwa Vinay
Aditya Grover
CLIPVLM
522
166
0
28 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
  Retrieval
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLMCLIP
227
7
0
24 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLMMLLM
281
270
0
24 May 2022
Charon: a FrameNet Annotation Tool for Multimodal Corpora
Charon: a FrameNet Annotation Tool for Multimodal CorporaLaw (LAW), 2022
Frederico Belcavello
Marcelo Viridiano
E. Matos
Haiyue Song
102
6
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLMMLLM
256
43
0
23 May 2022
Previous
123...171819...252627
Next
Page 18 of 27
Pageof 27