ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
Boosted Attention: Leveraging Human Attention for Image Captioning
Boosted Attention: Leveraging Human Attention for Image CaptioningEuropean Conference on Computer Vision (ECCV), 2018
Shi Chen
Qi Zhao
184
49
0
18 Mar 2019
Neural Language Modeling with Visual Features
Neural Language Modeling with Visual Features
Antonios Anastasopoulos
Shankar Kumar
H. Liao
VLM
111
26
0
07 Mar 2019
Graphical Contrastive Losses for Scene Graph Parsing
Graphical Contrastive Losses for Scene Graph Parsing
Ji Zhang
Kevin J. Shih
Ahmed Elgammal
Andrew Tao
Bryan Catanzaro
510
247
0
07 Mar 2019
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding
  Natural Language Descriptions in Videos
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
Dongliang He
Xiang Zhao
Jizhou Huang
Fu Li
Xiao-Chang Liu
Shilei Wen
212
164
0
21 Jan 2019
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
Hexiang Hu
Ishan Misra
Laurens van der Maaten
171
24
0
19 Jan 2019
A Hierarchical Grocery Store Image Dataset with Visual and Semantic
  Labels
A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels
Marcus Klasson
Cheng Zhang
Hedvig Kjellström
141
52
0
03 Jan 2019
Grounded Video Description
Grounded Video Description
Luowei Zhou
Yannis Kalantidis
Xinlei Chen
Jason J. Corso
Marcus Rohrbach
330
203
0
17 Dec 2018
Detecting unseen visual relations using analogies
Detecting unseen visual relations using analogies
Julia Peyre
Ivan Laptev
Cordelia Schmid
Josef Sivic
144
18
0
13 Dec 2018
PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase
  Grounding
PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding
Rama Kovvuri
Ram Nevatia
ObjD
156
20
0
07 Dec 2018
Multi-task Learning of Hierarchical Vision-Language Representation
Multi-task Learning of Hierarchical Vision-Language Representation
Duy-Kien Nguyen
Takayuki Okatani
259
56
0
03 Dec 2018
Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Hassan Akbari
Svebor Karaman
Surabhi Bhargava
Brian Chen
Carl Vondrick
Shih-Fu Chang
152
86
0
28 Nov 2018
From Recognition to Cognition: Visual Commonsense Reasoning
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
LRMBDLOCLReLM
838
995
0
27 Nov 2018
Show, Control and Tell: A Framework for Generating Controllable and
  Grounded Captions
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
DiffM
275
196
0
26 Nov 2018
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning
  for Vision-Language Navigation
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language NavigationComputer Vision and Pattern Recognition (CVPR), 2018
Xin Eric Wang
Qiuyuan Huang
Asli Celikyilmaz
Jianfeng Gao
Dinghan Shen
Yuan-fang Wang
William Yang Wang
Lei Zhang
LM&RoSSL
421
601
0
25 Nov 2018
SEIGAN: Towards Compositional Image Generation by Simultaneously
  Learning to Segment, Enhance, and Inpaint
SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint
Pavel Ostyakov
Roman Suvorov
Elizaveta Logacheva
Oleg Khomenko
Sergey I. Nikolenko
GAN
187
24
0
19 Nov 2018
Revisiting Image-Language Networks for Open-ended Phrase Detection
Revisiting Image-Language Networks for Open-ended Phrase Detection
Bryan A. Plummer
Kevin J. Shih
Yichen Li
Ke Xu
Svetlana Lazebnik
Stan Sclaroff
Kate Saenko
ObjDSSeg
144
4
0
17 Nov 2018
CUNI System for the WMT18 Multimodal Translation Task
CUNI System for the WMT18 Multimodal Translation Task
Jindřich Helcl
Jindrich Libovický
Dušan Variš
213
61
0
12 Nov 2018
Reducing Network Agnostophobia
Reducing Network Agnostophobia
A. Dhamija
Manuel Günther
Terrance E. Boult
AAMLUQCV
390
348
0
09 Nov 2018
How2: A Large-scale Dataset for Multimodal Language Understanding
How2: A Large-scale Dataset for Multimodal Language Understanding
Ramon Sanabria
Ozan Caglayan
Shruti Palaskar
Desmond Elliott
Loïc Barrault
Lucia Specia
Florian Metze
VGenMLLM
266
312
0
01 Nov 2018
Learning to Globally Edit Images with Textual Description
Learning to Globally Edit Images with Textual Description
Hai Wang
Jason D. Williams
Sin-Han Kang
DiffM
146
18
0
13 Oct 2018
Image Captioning as Neural Machine Translation Task in SOCKEYE
Image Captioning as Neural Machine Translation Task in SOCKEYE
Loris Bazzani
Tobias Domhan
Felix Hieber
VLM
167
2
0
09 Oct 2018
A Comprehensive Survey of Deep Learning for Image Captioning
A Comprehensive Survey of Deep Learning for Image Captioning
Md Zakir Hossain
Ferdous Sohel
M. Shiratuddin
Hamid Laga
VLM3DV
309
850
0
06 Oct 2018
Visual Coreference Resolution in Visual Dialog using Neural Module
  Networks
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Satwik Kottur
José M. F. Moura
Devi Parikh
Dhruv Batra
Marcus Rohrbach
216
170
0
06 Sep 2018
TVQA: Localized, Compositional Video Question Answering
TVQA: Localized, Compositional Video Question Answering
Muhammad Abdul Wahab
Licheng Yu
Mounir Nasr Allah
Tamara L. Berg
443
724
0
05 Sep 2018
Learning to Describe Differences Between Pairs of Similar Images
Learning to Describe Differences Between Pairs of Similar Images
Harsh Jhamtani
Taylor Berg-Kirkpatrick
219
194
0
31 Aug 2018
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Niluthpol Chowdhury Mithun
Yikang Shen
Evangelos E. Papalexakis
Amit K. Roy-Chowdhury
216
78
0
23 Aug 2018
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense
  Inference
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Rowan Zellers
Yonatan Bisk
Roy Schwartz
Yejin Choi
502
765
0
16 Aug 2018
Doubly Attentive Transformer Machine Translation
Doubly Attentive Transformer Machine Translation
Hasan Sait Arslan
Mark Fishel
G. Anbarjafari
191
18
0
30 Jul 2018
A Pipeline for Creative Visual Storytelling
A Pipeline for Creative Visual Storytelling
S. Lukin
Reginald L. Hobbs
Clare R. Voss
108
29
0
21 Jul 2018
Revisiting Cross Modal Retrieval
Revisiting Cross Modal Retrieval
Shah Nawaz
Muhammad Kamran Janjua
Alessandro Calefati
I. Gallo
134
6
0
19 Jul 2018
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph
  Generation
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph GenerationEuropean Conference on Computer Vision (ECCV), 2018
Yikang Li
Wanli Ouyang
Bolei Zhou
Jianping Shi
Yawen Cui
Xiaogang Wang
GNN
251
280
0
29 Jun 2018
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
Chenhui Chu
Mayu Otani
Yuta Nakashima
138
8
0
12 Jun 2018
Speaker-Follower Models for Vision-and-Language Navigation
Speaker-Follower Models for Vision-and-Language Navigation
Daniel Fried
Ronghang Hu
Volkan Cirik
Anna Rohrbach
Jacob Andreas
Louis-Philippe Morency
Taylor Berg-Kirkpatrick
Kate Saenko
Dan Klein
Trevor Darrell
LM&RoLRM
690
562
0
07 Jun 2018
Video Description: A Survey of Methods, Datasets and Evaluation Metrics
Video Description: A Survey of Methods, Datasets and Evaluation Metrics
Nayyer Aafaq
Lin Wang
Wen Liu
Syed Zulqarnain Gilani
Mubarak Shah
486
101
0
01 Jun 2018
Bilinear Attention Networks
Bilinear Attention Networks
Jin-Hwa Kim
Jaehyun Jun
Byoung-Tak Zhang
AIMat
492
995
0
21 May 2018
Rethinking Diversified and Discriminative Proposal Generation for Visual
  Grounding
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
Zhou Yu
Jun-chen Yu
Chenchao Xiang
Zhou Zhao
Q. Tian
Dacheng Tao
ObjD
214
147
0
09 May 2018
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and
  Object Interaction
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
Luowei Zhou
Nathan Louis
Jason J. Corso
301
101
0
08 May 2018
Hypothesis Only Baselines in Natural Language Inference
Hypothesis Only Baselines in Natural Language Inference
Adam Poliak
Jason Naradowsky
Aparajita Haldar
Rachel Rudinger
Benjamin Van Durme
528
609
0
02 May 2018
Dialog-based Interactive Image Retrieval
Dialog-based Interactive Image Retrieval
Xiaoxiao Guo
Hui Wu
Yu Cheng
Steven J. Rennie
Gerald Tesauro
Rogerio Feris
387
227
0
01 May 2018
Imagine This! Scripts to Compositions to Videos
Imagine This! Scripts to Compositions to Videos
Tanmay Gupta
Dustin Schwenk
Ali Farhadi
Derek Hoiem
Aniruddha Kembhavi
CoGeVGen
262
97
0
10 Apr 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
339
244
0
07 Apr 2018
Interpretable and Globally Optimal Prediction for Textual Grounding
  using Image Concepts
Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
Raymond A. Yeh
Jinjun Xiong
Wen-mei W. Hwu
Minh Do
Alex Schwing
130
58
0
29 Mar 2018
Two can play this Game: Visual Dialog with Discriminative Question
  Generation and Answering
Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering
Unnat Jain
Svetlana Lazebnik
Alex Schwing
MLLM
128
83
0
29 Mar 2018
Unsupervised Textual Grounding: Linking Words to Image Concepts
Unsupervised Textual Grounding: Linking Words to Image Concepts
Raymond A. Yeh
Minh Do
Alex Schwing
123
44
0
29 Mar 2018
Neural Baby Talk
Neural Baby Talk
Jiasen Lu
Jianwei Yang
Dhruv Batra
Devi Parikh
VLM
398
458
0
27 Mar 2018
Video Object Segmentation with Language Referring Expressions
Video Object Segmentation with Language Referring Expressions
Anna Khoreva
Anna Rohrbach
Bernt Schiele
VOS
261
243
0
21 Mar 2018
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
Syed Ashar Javed
Shreyas Saxena
Vineet Gandhi
SSL
205
25
0
17 Mar 2018
Object Captioning and Retrieval with Natural Language
Object Captioning and Retrieval with Natural Language
A. Nguyen
Thanh-Toan Do
Ian Reid
D. Caldwell
Nikos G. Tsagarakis
3DV
109
21
0
16 Mar 2018
Unpaired Image Captioning by Language Pivoting
Unpaired Image Captioning by Language PivotingEuropean Conference on Computer Vision (ECCV), 2018
Jiuxiang Gu
Shafiq Joty
Jianfei Cai
G. Wang
253
89
0
14 Mar 2018
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
Kan Chen
J. Gao
Ram Nevatia
181
99
0
11 Mar 2018
Previous
123...24252627
Next
Page 25 of 27
Pageof 27