ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
The Case for Perspective in Multimodal Datasets
The Case for Perspective in Multimodal Datasets
Marcelo Viridiano
Haiyue Song
Oliver Czulo
Arthur Lorenzi
E. Matos
Frederico Belcavello
118
7
0
22 May 2022
Training Vision-Language Transformers from Captions
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLMViT
425
11
0
19 May 2022
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual
  Context for Image Captioning
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image CaptioningComputer Vision and Pattern Recognition (CVPR), 2022
Chia-Wen Kuo
Z. Kira
266
81
0
09 May 2022
RoViST:Learning Robust Metrics for Visual Storytelling
RoViST:Learning Robust Metrics for Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
166
13
0
08 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLMMLLM
274
111
0
05 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLMCLIPOffRL
708
1,616
0
04 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
281
18
0
02 May 2022
Improving Visual Grounding with Visual-Linguistic Verification and
  Iterative Reasoning
Improving Visual Grounding with Visual-Linguistic Verification and Iterative ReasoningComputer Vision and Pattern Recognition (CVPR), 2022
Li Yang
Yan Xu
Chunfen Yuan
Wei Liu
Bing Li
Weiming Hu
ObjD
293
155
0
30 Apr 2022
Leaner and Faster: Two-Stage Model Compression for Lightweight
  Text-Image Retrieval
Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image RetrievalNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022
Siyu Ren
Kenny Q. Zhu
VLM
102
9
0
29 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
  Expression Comprehension
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression ComprehensionIEEE Transactions on Image Processing (IEEE TIP), 2022
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
339
13
0
21 Apr 2022
Making the Most of Text Semantics to Improve Biomedical Vision--Language
  Processing
Making the Most of Text Semantics to Improve Biomedical Vision--Language ProcessingEuropean Conference on Computer Vision (ECCV), 2022
Benedikt Boecking
Naoto Usuyama
Shruthi Bannur
Daniel Coelho De Castro
Anton Schwaighofer
...
Tristan Naumann
A. Nori
Javier Alvarez-Valle
Hoifung Poon
Ozan Oktay
496
368
0
21 Apr 2022
Transformer Decoders with MultiModal Regularization for Cross-Modal Food
  Retrieval
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor
Guillaume Couairon
Asya Grechka
Matthieu Cord
ViT
204
23
0
20 Apr 2022
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
  Visual Models
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual ModelsNeural Information Processing Systems (NeurIPS), 2022
Chunyuan Li
Haotian Liu
Liunian Harold Li
Pengchuan Zhang
J. Aneja
...
Ping Jin
Houdong Hu
Zicheng Liu
Yong Jae Lee
Jianfeng Gao
297
177
0
19 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
  Cross-Modal Retrieval
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalComputer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIPVLM
262
55
0
15 Apr 2022
Brainish: Formalizing A Multimodal Language for Intelligence and
  Consciousness
Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness
Paul Pu Liang
359
7
0
14 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
X-DETR: A Versatile Architecture for Instance-wise Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2022
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjDMLLMVLM
162
52
0
12 Apr 2022
Adapting CLIP For Phrase Localization Without Further Training
Adapting CLIP For Phrase Localization Without Further Training
Jiahao Li
G. Shakhnarovich
Raymond A. Yeh
VLMCLIP
222
26
0
07 Apr 2022
ECCV Caption: Correcting False Negatives by Collecting
  Machine-and-Human-verified Image-Caption Associations for MS-COCO
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCOEuropean Conference on Computer Vision (ECCV), 2022
Sanghyuk Chun
Wonjae Kim
Song Park
Minsuk Chang
Seong Joon Oh
VLM
1.5K
51
0
07 Apr 2022
Multi-View Transformer for 3D Visual Grounding
Multi-View Transformer for 3D Visual GroundingComputer Vision and Pattern Recognition (CVPR), 2022
Shijia Huang
Yilun Chen
Jiaya Jia
Liwei Wang
398
177
0
05 Apr 2022
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
  Grounding
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding
Ziyue Wu
Junyu Gao
Shucheng Huang
Changsheng Xu
241
6
0
04 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
FindIt: Generalized Localization with Natural Language QueriesEuropean Conference on Computer Vision (ECCV), 2022
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
215
18
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Shifting More Attention to Visual Backbone: Query-modulated Refinement
  Networks for End-to-End Visual Grounding
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual GroundingComputer Vision and Pattern Recognition (CVPR), 2022
Jiabo Ye
Junfeng Tian
Ming Yan
Xiaoshan Yang
Xuwu Wang
Ji Zhang
Liang He
Xin Lin
ObjD
234
93
0
29 Mar 2022
Large-scale Bilingual Language-Image Contrastive Learning
Large-scale Bilingual Language-Image Contrastive Learning
ByungSoo Ko
Geonmo Gu
VLM
285
17
0
28 Mar 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Single-Stream Multi-Level Alignment for Vision-Language PretrainingEuropean Conference on Computer Vision (ECCV), 2022
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIPVLM
357
22
0
27 Mar 2022
Knowledge Mining with Scene Text for Fine-Grained Recognition
Knowledge Mining with Scene Text for Fine-Grained RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Hao Wang
Junchao Liao
Tianheng Cheng
Zewen Gao
Hao Liu
Bo Ren
X. Bai
Wenyu Liu
208
14
0
27 Mar 2022
Look for the Change: Learning Object States and State-Modifying Actions
  from Untrimmed Web Videos
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web VideosComputer Vision and Pattern Recognition (CVPR), 2022
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
247
43
0
22 Mar 2022
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
Shan Yuan
Shuai Zhao
Jiahong Leng
Zhao Xue
Hanyu Zhao
Peiyu Liu
Zheng Gong
Wayne Xin Zhao
Junyi Li
Tang Jie
VLM
270
6
0
22 Mar 2022
Finding Structural Knowledge in Multimodal-BERT
Finding Structural Knowledge in Multimodal-BERTAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
220
12
0
17 Mar 2022
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Pseudo-Q: Generating Pseudo Language Queries for Visual GroundingComputer Vision and Pattern Recognition (CVPR), 2022
Haojun Jiang
Yuanze Lin
Dongchen Han
Shiji Song
Gao Huang
ObjD
324
65
0
16 Mar 2022
NLX-GPT: A Model for Natural Language Explanations in Vision and
  Vision-Language Tasks
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2022
Fawaz Sammani
Tanmoy Mukherjee
Nikos Deligiannis
MILMELMLRM
321
75
0
09 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Geodesic Multi-Modal Mixup for Robust Fine-TuningNeural Information Processing Systems (NeurIPS), 2022
Changdae Oh
Junhyuk So
Hoyoon Byun
Yongtaek Lim
Minchul Shin
Jong-June Jeon
Kyungwoo Song
458
39
0
08 Mar 2022
Unpaired Image Captioning by Image-level Weakly-Supervised Visual
  Concept Recognition
Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept RecognitionIEEE transactions on multimedia (IEEE TMM), 2022
Peipei Zhu
Tianlin Li
Yong Luo
Zhenglong Sun
Wei-Shi Zheng
Yaowei Wang
Chen Chen
220
15
0
07 Mar 2022
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in
  Context
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in ContextEuropean Conference on Computer Vision (ECCV), 2022
Pinaki Nath Chowdhury
Aneeshan Sain
A. Bhunia
Tao Xiang
Yulia Gryaditskaya
Yi-Zhe Song
3DV
337
66
0
04 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large
  Models
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TSVLM
212
41
0
03 Mar 2022
Multi-modal Alignment using Representation Codebook
Multi-modal Alignment using Representation CodebookComputer Vision and Pattern Recognition (CVPR), 2022
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
511
79
0
28 Feb 2022
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing TranslationInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Peter Schaldenbrand
Zhixuan Liu
Jean Oh
CLIP
194
50
0
24 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
GroupViT: Semantic Segmentation Emerges from Text SupervisionComputer Vision and Pattern Recognition (CVPR), 2022
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViTVLM
765
637
0
22 Feb 2022
Vision-Language Pre-Training with Triple Contrastive Learning
Vision-Language Pre-Training with Triple Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2022
Jinyu Yang
Jiali Duan
Son N. Tran
Yi Xu
Sampath Chanda
Liqun Chen
Belinda Zeng
Trishul Chilimbi
Junzhou Huang
VLM
575
358
0
21 Feb 2022
On Guiding Visual Attention with Language Specification
On Guiding Visual Attention with Language SpecificationComputer Vision and Pattern Recognition (CVPR), 2022
Suzanne Petryk
Lisa Dunlap
Keyan Nasseri
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
VLM
415
39
1
17 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
  Omni Retrieval
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni RetrievalKnowledge Discovery and Data Mining (KDD), 2022
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
263
44
0
15 Feb 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training
  Benchmark
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkNeural Information Processing Systems (NeurIPS), 2022
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
410
138
0
14 Feb 2022
I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image
  Captioning
I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image CaptioningIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Ziyang Luo
Zhipeng Hu
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
187
29
0
14 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Multi-Modal Knowledge Graph Construction and Application: A SurveyIEEE Transactions on Knowledge and Data Engineering (TKDE), 2022
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Yixiang Chen
Xuwu Wang
Yanghua Xiao
N. Yuan
211
238
0
11 Feb 2022
Bench-Marking And Improving Arabic Automatic Image Captioning Through
  The Use Of Multi-Task Learning Paradigm
Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm
Muhy Eddin Za'ter
Bashar Talafha
VLM
199
6
0
11 Feb 2022
Keyword localisation in untranscribed speech using visually grounded
  speech models
Keyword localisation in untranscribed speech using visually grounded speech modelsIEEE Journal on Selected Topics in Signal Processing (IEEE JSTSP), 2022
Kayode Olaleye
Dan Oneaţă
Herman Kamper
206
7
0
02 Feb 2022
Deep Learning Approaches on Image Captioning: A Review
Deep Learning Approaches on Image Captioning: A ReviewACM Computing Surveys (ACM CSUR), 2022
Taraneh Ghandi
H. Pourreza
H. Mahyar
VLM
487
155
0
31 Jan 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLMMLLM
244
19
0
30 Jan 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training
  via Multi-Stage Learning
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage LearningACM Multimedia (ACM MM), 2022
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
245
23
0
29 Jan 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLMBDLVLMCLIP
1.4K
5,888
0
28 Jan 2022
Previous
123...181920...252627
Next
Page 19 of 27
Pageof 27