ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1912.03098
  4. Cited By
Connecting Vision and Language with Localized Narratives
v1v2v3v4 (latest)

Connecting Vision and Language with Localized Narratives

European Conference on Computer Vision (ECCV), 2019
6 December 2019
Jordi Pont-Tuset
J. Uijlings
Soravit Changpinyo
Radu Soricut
V. Ferrari
    ObjD
ArXiv (abs)PDFHTML

Papers citing "Connecting Vision and Language with Localized Narratives"

50 / 200 papers shown
Pre-training image-language transformers for open-vocabulary tasks
Pre-training image-language transformers for open-vocabulary tasks
A. Piergiovanni
Weicheng Kuo
A. Angelova
VLMViT
176
12
0
09 Sep 2022
Multimodal Lecture Presentations Dataset: Understanding Multimodality in
  Educational Slides
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides
Dong Won Lee
Chaitanya Ahuja
Paul Pu Liang
Sanika Natu
Louis-Philippe Morency
278
10
0
17 Aug 2022
Layout-Bridging Text-to-Image Synthesis
Layout-Bridging Text-to-Image Synthesis
Jiadong Liang
Wenjie Pei
Feng Lu
EGVM
163
20
0
12 Aug 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
  Grounding
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative GroundingACM Multimedia (ACM MM), 2022
Zihan Ding
Zixiang Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Si Liu
198
15
0
11 Aug 2022
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and SketchEuropean Conference on Computer Vision (ECCV), 2022
Patsorn Sangkloy
Wittawat Jitkrittum
Diyi Yang
James Hays
3DV
181
42
0
05 Aug 2022
Cross-Modal Alignment Learning of Vision-Language Conceptual Systems
Cross-Modal Alignment Learning of Vision-Language Conceptual Systems
Taehyeong Kim
H. Song
Byoung-Tak Zhang
202
5
0
31 Jul 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
...
Zarana Parekh
Xin Li
Han Zhang
Jason Baldridge
Yonghui Wu
EGVM
644
1,359
0
22 Jun 2022
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation DatasetConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Ashish V. Thapliyal
Jordi Pont-Tuset
Xi Chen
Radu Soricut
VGen
575
106
0
25 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
279
18
0
02 May 2022
Improving Multimodal Speech Recognition by Data Augmentation and Speech
  Representations
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Dan Oneaţă
H. Cucu
118
24
0
27 Apr 2022
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo
  and Text
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and TextComputer Vision and Pattern Recognition (CVPR), 2022
Pinaki Nath Chowdhury
A. Bhunia
Aneeshan Sain
Subhadeep Koley
Tao Xiang
Yi-Zhe Song
397
37
0
25 Apr 2022
It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image
  Captioning by Contrastive Data Collection
It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data CollectionComputer Vision and Pattern Recognition (CVPR), 2022
Youssef Mohamed
Faizan Farooq Khan
Kilichbek Haydarov
Mohamed Elhoseiny
127
44
0
15 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
X-DETR: A Versatile Architecture for Instance-wise Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2022
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjDMLLMVLM
141
51
0
12 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic
  Compositionality
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityComputer Vision and Pattern Recognition (CVPR), 2022
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
374
521
0
07 Apr 2022
KNN-Diffusion: Image Generation via Large-Scale Retrieval
KNN-Diffusion: Image Generation via Large-Scale RetrievalInternational Conference on Learning Representations (ICLR), 2022
Shelly Sheynin
Oron Ashual
Adam Polyak
Uriel Singer
Oran Gafni
Eliya Nachmani
Yaniv Taigman
VLMSyDaDiffM
238
147
0
06 Apr 2022
DT2I: Dense Text-to-Image Generation from Region Descriptions
DT2I: Dense Text-to-Image Generation from Region DescriptionsInternational Conference on Artificial Neural Networks (ICANN), 2022
Stanislav Frolov
Prateek Bansal
Jörn Hees
Andreas Dengel
VLM
159
5
0
05 Apr 2022
Keyword localisation in untranscribed speech using visually grounded
  speech models
Keyword localisation in untranscribed speech using visually grounded speech modelsIEEE Journal on Selected Topics in Signal Processing (IEEE JSTSP), 2022
Kayode Olaleye
Dan Oneaţă
Herman Kamper
193
7
0
02 Feb 2022
Deep Learning Approaches on Image Captioning: A Review
Deep Learning Approaches on Image Captioning: A ReviewACM Computing Surveys (ACM CSUR), 2022
Taraneh Ghandi
H. Pourreza
H. Mahyar
VLM
480
150
0
31 Jan 2022
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsEuropean Conference on Computer Vision (ECCV), 2021
Golnaz Ghiasi
Xiuye Gu
Huayu Chen
Nayeon Lee
VLM
444
494
0
22 Dec 2021
MAGMA -- Multimodal Augmentation of Generative Models through
  Adapter-based Finetuning
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
C. Eichenberg
Sid Black
Samuel Weinbach
Letitia Parcalabescu
Anette Frank
MLLMVLM
259
109
0
09 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
355
863
0
08 Dec 2021
Object-Centric Unsupervised Image Captioning
Object-Centric Unsupervised Image Captioning
Zihang Meng
David Yang
Xuefei Cao
Ashish Shah
Ser-Nam Lim
OCLVLM
194
14
0
02 Dec 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
LAFITE: Towards Language-Free Training for Text-to-Image GenerationComputer Vision and Pattern Recognition (CVPR), 2021
Jiuxiang Gu
Ruiyi Zhang
Changyou Chen
Chunyuan Li
Chris Tensmeyer
Tong Yu
Jiuxiang Gu
Jinhui Xu
Tong Sun
VLM
293
204
0
27 Nov 2021
Less is More: Generating Grounded Navigation Instructions from Landmarks
Less is More: Generating Grounded Navigation Instructions from Landmarks
Su Wang
Ceslee Montgomery
Jordi Orbay
Vighnesh Birodkar
Aleksandra Faust
Izzeddin Gur
Natasha Jaques
Austin Waters
Jason Baldridge
Peter Anderson
433
81
0
25 Nov 2021
Integrating Visuospatial, Linguistic and Commonsense Structure into
  Story Visualization
Integrating Visuospatial, Linguistic and Commonsense Structure into Story VisualizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
A. Maharana
Joey Tianyi Zhou
252
70
0
21 Oct 2021
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Ian Palmer
Andrew Rouditchenko
Andrei Barbu
Boris Katz
James R. Glass
130
4
0
14 Oct 2021
What Vision-Language Models `See' when they See Scenes
What Vision-Language Models `See' when they See Scenes
Michele Cafagna
Kees van Deemter
Albert Gatt
VLM
259
13
0
15 Sep 2021
Panoptic Narrative Grounding
Panoptic Narrative GroundingIEEE International Conference on Computer Vision (ICCV), 2021
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
248
27
0
10 Sep 2021
LocTex: Learning Data-Efficient Visual Representations from Localized
  Textual Supervision
LocTex: Learning Data-Efficient Visual Representations from Localized Textual SupervisionIEEE International Conference on Computer Vision (ICCV), 2021
Zhijian Liu
Simon Stent
Jie Li
John Gideon
Song Han
VLM
188
10
0
26 Aug 2021
From Show to Tell: A Survey on Deep Learning-based Image Captioning
From Show to Tell: A Survey on Deep Learning-based Image CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
S. Cascianelli
G. Fiameni
Rita Cucchiara
3DVVLMMLLM
435
344
0
14 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
  Generation
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
292
41
0
01 Jul 2021
A Picture May Be Worth a Hundred Words for Visual Question Answering
A Picture May Be Worth a Hundred Words for Visual Question Answering
Yusuke Hirota
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Ittetsu Taniguchi
Takao Onoye
ViT
145
4
0
25 Jun 2021
Bridging the Gap Between Object Detection and User Intent via
  Query-Modulation
Bridging the Gap Between Object Detection and User Intent via Query-Modulation
Marco Fornoni
Chaochao Yan
Liangchen Luo
Kimberly Wilber
A. Stark
Huayu Chen
Boqing Gong
Andrew G. Howard
ObjD
128
1
0
18 Jun 2021
Connecting What to Say With Where to Look by Modeling Human Attention
  Traces
Connecting What to Say With Where to Look by Modeling Human Attention TracesComputer Vision and Pattern Recognition (CVPR), 2021
Zihang Meng
Licheng Yu
Ning Zhang
Tamara L. Berg
Babak Damavandi
Vikas Singh
Amy Bearman
261
31
0
12 May 2021
Concadia: Towards Image-Based Text Generation with a Purpose
Concadia: Towards Image-Based Text Generation with a PurposeConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Elisa Kreiss
Fei Fang
Noah D. Goodman
Christopher Potts
227
25
0
16 Apr 2021
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Talk, Don't Write: A Study of Direct Speech-Based Image RetrievalInterspeech (Interspeech), 2021
Ramon Sanabria
Austin Waters
Jason Baldridge
3DV
191
27
0
05 Apr 2021
PanGEA: The Panoramic Graph Environment Annotation Toolkit
PanGEA: The Panoramic Graph Environment Annotation Toolkit
Alexander Ku
Peter Anderson
Jordi Pont-Tuset
Jason Baldridge
164
2
0
23 Mar 2021
Human-like Controllable Image Captioning with Verb-specific Semantic
  Roles
Human-like Controllable Image Captioning with Verb-specific Semantic RolesComputer Vision and Pattern Recognition (CVPR), 2021
Long Chen
Zhihong Jiang
Jun Xiao
Wei Liu
252
82
0
22 Mar 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,360
0
17 Feb 2021
Telling the What while Pointing to the Where: Multimodal Queries for
  Image Retrieval
Telling the What while Pointing to the Where: Multimodal Queries for Image RetrievalIEEE International Conference on Computer Vision (ICCV), 2021
Soravit Changpinyo
Jordi Pont-Tuset
V. Ferrari
Radu Soricut
197
28
0
09 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
234
126
0
31 Jan 2021
Adversarial Text-to-Image Synthesis: A Review
Adversarial Text-to-Image Synthesis: A ReviewNeural Networks (NN), 2021
Stanislav Frolov
Tobias Hinz
Federico Raue
Jörn Hees
Andreas Dengel
EGVM
321
201
0
25 Jan 2021
ArtEmis: Affective Language for Visual Art
ArtEmis: Affective Language for Visual ArtComputer Vision and Pattern Recognition (CVPR), 2021
Panos Achlioptas
M. Ovsjanikov
Kilichbek Haydarov
Mohamed Elhoseiny
Leonidas Guibas
133
152
0
19 Jan 2021
Cross-Modal Contrastive Learning for Text-to-Image Generation
Cross-Modal Contrastive Learning for Text-to-Image GenerationComputer Vision and Pattern Recognition (CVPR), 2021
Han Zhang
Jing Yu Koh
Jason Baldridge
Honglak Lee
Yinfei Yang
GAN
512
417
0
12 Jan 2021
StacMR: Scene-Text Aware Cross-Modal Retrieval
StacMR: Scene-Text Aware Cross-Modal Retrieval
Andrés Mafla
Rafael Sampaio de Rezende
Lluís Gómez
Diane Larlus
Dimosthenis Karatzas
3DV
194
19
0
08 Dec 2020
Understanding Guided Image Captioning Performance across Domains
Understanding Guided Image Captioning Performance across DomainsConference on Computational Natural Language Learning (CoNLL), 2020
Edwin G. Ng
Bo Pang
P. Sharma
Radu Soricut
369
28
0
04 Dec 2020
Text-to-Image Generation Grounded by Fine-Grained User Attention
Text-to-Image Generation Grounded by Fine-Grained User Attention
Jing Yu Koh
Jason Baldridge
Honglak Lee
Yinfei Yang
DiffM
260
64
0
07 Nov 2020
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense
  Spatiotemporal Grounding
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
Alexander Ku
Peter Anderson
Roma Patel
Eugene Ie
Jason Baldridge
233
416
0
15 Oct 2020
Vokenization: Improving Language Understanding with Contextualized,
  Visual-Grounded Supervision
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan
Joey Tianyi Zhou
CLIP
200
129
0
14 Oct 2020
Fine-Grained Grounding for Multimodal Speech Recognition
Fine-Grained Grounding for Multimodal Speech RecognitionFindings (Findings), 2020
Tejas Srinivasan
Ramon Sanabria
Florian Metze
Desmond Elliott
161
11
0
05 Oct 2020
Previous
1234