ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
A Survey on Visual Transfer Learning using Knowledge Graphs
A Survey on Visual Transfer Learning using Knowledge Graphs
Sebastian Monka
Lavdim Halilaj
Achim Rettinger
254
27
0
27 Jan 2022
PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy
  Labels
PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy Labels
A. Goel
Yunlong Jiao
Jordan Massiah
NoLa
170
10
0
26 Jan 2022
Supervised Visual Attention for Simultaneous Multimodal Machine
  Translation
Supervised Visual Attention for Simultaneous Multimodal Machine TranslationJournal of Artificial Intelligence Research (JAIR), 2022
Veneta Haralampieva
Ozan Caglayan
Lucia Specia
LRM
223
4
0
23 Jan 2022
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
  Matching
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal MatchingNeurocomputing (Neurocomputing), 2022
Hengcan Shi
Munawar Hayat
Jianfei Cai
ObjD
209
12
0
18 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
  Vision-Language Pre-training
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLMObjD
222
24
0
11 Jan 2022
Semantically Grounded Visual Embeddings for Zero-Shot Learning
Semantically Grounded Visual Embeddings for Zero-Shot Learning
Shah Nawaz
Jacopo Cavazza
Alessio Del Bue
ObjDFedMLVLM
285
6
0
03 Jan 2022
Deconfounded Visual Grounding
Deconfounded Visual GroundingAAAI Conference on Artificial Intelligence (AAAI), 2021
Jianqiang Huang
Yu Qin
Jiaxin Qi
Qianru Sun
Hanwang Zhang
CMLObjD
203
38
0
31 Dec 2021
Grounding Linguistic Commands to Navigable Regions
Grounding Linguistic Commands to Navigable RegionsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2021
N. Rufus
Kanishk Jain
U. R. Nair
Vineet Gandhi
K. M. Krishna
LM&Ro
214
13
0
24 Dec 2021
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsEuropean Conference on Computer Vision (ECCV), 2021
Golnaz Ghiasi
Xiuye Gu
Huayu Chen
Nayeon Lee
VLM
444
497
0
22 Dec 2021
A Survey of Natural Language Generation
A Survey of Natural Language GenerationACM Computing Surveys (CSUR), 2021
Chenhe Dong
Hai-Tao Zheng
Haifan Gong
Mengzhao Chen
Junxin Li
Ying Shen
Min Yang
3DV
336
65
0
22 Dec 2021
ScanQA: 3D Question Answering for Spatial Scene Understanding
ScanQA: 3D Question Answering for Spatial Scene UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
M. Kawanabe
444
328
0
20 Dec 2021
Bottom Up Top Down Detection Transformers for Language Grounding in
  Images and Point Clouds
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds
Ayush Jain
N. Gkanatsios
Ishita Mediratta
Katerina Fragkiadaki
ObjD
492
148
0
16 Dec 2021
Distilled Dual-Encoder Model for Vision-Language Understanding
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLMFedML
214
35
0
16 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models
  Centered on Linguistic Phenomena
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
303
137
0
14 Dec 2021
Unified Multimodal Pre-training and Prompt-based Tuning for
  Vision-Language Understanding and Generation
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Yue Yu
Yu-Gang Jiang
VLMMLLM
221
11
0
10 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
383
873
0
08 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
167
9
0
08 Dec 2021
Grounded Language-Image Pre-training
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjDVLM
468
1,407
0
07 Dec 2021
From Coarse to Fine-grained Concept based Discrimination for Phrase
  Detection
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection
Maan Qraitem
Bryan A. Plummer
ObjD
200
0
0
06 Dec 2021
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning
  and Visual Grounding
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen
Qirui Wu
Matthias Nießner
Angel X. Chang
196
51
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
252
152
0
02 Dec 2021
Weakly-Supervised Video Object Grounding via Causal Intervention
Weakly-Supervised Video Object Grounding via Causal Intervention
Wei Wang
Junyu Gao
Changsheng Xu
CML
325
32
0
01 Dec 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
357
134
0
23 Nov 2021
Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
409
1,060
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
Class-agnostic Object Detection with Multi-modal TransformerEuropean Conference on Computer Vision (ECCV), 2021
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
625
117
0
22 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
  Concepts
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsInternational Conference on Machine Learning (ICML), 2021
Yan Zeng
Xinsong Zhang
Hang Li
VLMCLIP
345
356
0
16 Nov 2021
Memotion Analysis through the Lens of Joint Embedding
Memotion Analysis through the Lens of Joint EmbeddingAAAI Conference on Artificial Intelligence (AAAI), 2021
Nethra Gunti
Sathyanarayanan Ramamoorthy
Parth Patwa
Amitava Das
130
8
0
13 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-TrainingInternational Conference on Learning Representations (ICLR), 2021
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLMCLIP
343
769
0
09 Nov 2021
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences
  for Image-Text Retrieval
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Jianqing Fan
186
7
0
05 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language
  Transformers
An Empirical Study of Training End-to-End Vision-and-Language TransformersComputer Vision and Pattern Recognition (CVPR), 2021
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
302
438
0
03 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-ExpertsNeural Information Processing Systems (NeurIPS), 2021
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLMMLLMMoE
981
693
0
03 Nov 2021
Bangla Image Caption Generation through CNN-Transformer based
  Encoder-Decoder Network
Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network
Yuansan Liu
MD Abdullah Al Nasim
Sourav Saha
Faria Afrin
Raisa Mallik
Sathishkumar Samiappan
ViT
152
16
0
24 Oct 2021
Text-Based Person Search with Limited Data
Text-Based Person Search with Limited DataBritish Machine Vision Conference (BMVC), 2021
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
210
125
0
20 Oct 2021
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
  in Instructional Videos
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional VideosNeural Information Processing Systems (NeurIPS), 2021
Reuben Tan
Bryan A. Plummer
Kate Saenko
Hailin Jin
Bryan C. Russell
SSL
213
28
0
20 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal
  Retrieval
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal RetrievalKnowledge-Based Systems (KBS), 2021
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViTVLM
220
24
0
20 Oct 2021
Towards Language-guided Visual Recognition via Dynamic Convolutions
Towards Language-guided Visual Recognition via Dynamic Convolutions
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yongjian Wu
Yue Gao
Rongrong Ji
ObjD
242
27
0
17 Oct 2021
Unsupervised Natural Language Inference Using PHL Triplet Generation
Unsupervised Natural Language Inference Using PHL Triplet Generation
Neeraj Varshney
Pratyay Banerjee
Tejas Gokhale
Chitta Baral
261
10
0
16 Oct 2021
Is An Image Worth Five Sentences? A New Look into Semantics for
  Image-Text Matching
Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
459
20
0
06 Oct 2021
Learning Structural Representations for Recipe Generation and Food
  Retrieval
Learning Structural Representations for Recipe Generation and Food Retrieval
Hao Wang
Guosheng Lin
Guosheng Lin
Chunyan Miao
157
38
0
04 Oct 2021
CIDEr-R: Robust Consensus-based Image Description Evaluation
CIDEr-R: Robust Consensus-based Image Description Evaluation
G. O. D. Santos
Esther Luna Colombini
Sandra Avila
161
42
0
28 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLMVPVLMVLM
594
245
0
24 Sep 2021
Discovering and Validating AI Errors With Crowdsourced Failure Reports
Discovering and Validating AI Errors With Crowdsourced Failure Reports
Ángel Alexander Cabrera
Abraham J. Druck
Jason I. Hong
Adam Perer
HAI
181
63
0
23 Sep 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
  Knowledge Distillation
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIPVLM
283
32
0
22 Sep 2021
Associative Memories via Predictive Coding
Associative Memories via Predictive Coding
Tommaso Salvatori
Yuhang Song
Yujian Hong
Simon Frieder
Lei Sha
Zhenghua Xu
Rafal Bogacz
Thomas Lukasiewicz
196
78
0
16 Sep 2021
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Da Yin
Liunian Harold Li
Ziniu Hu
Nanyun Peng
Kai-Wei Chang
300
65
0
14 Sep 2021
xGQA: Cross-Lingual Visual Question Answering
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
362
80
0
13 Sep 2021
DSSL: Deep Surroundings-person Separation Learning for Text-based Person
  Retrieval
DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
A. Zhu
Zijie Wang
Yifeng Li
Xili Wan
Jing Jin
Tian Wang
Fangqiang Hu
G. Hua
267
253
0
12 Sep 2021
Constructing Phrase-level Semantic Labels to Form Multi-Grained
  Supervision for Image-Text Retrieval
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Haijun Shan
Xuanjing Huang
Jianqing Fan
CLIP
118
12
0
12 Sep 2021
Panoptic Narrative Grounding
Panoptic Narrative GroundingIEEE International Conference on Computer Vision (ICCV), 2021
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
258
28
0
10 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
  Multimodal Transformers
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Stella Frank
Emanuele Bugliarello
Desmond Elliott
190
95
0
09 Sep 2021
Previous
123...192021...252627
Next
Page 20 of 27
Pageof 27