Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1912.03098
Cited By
v1
v2
v3
v4 (latest)
Connecting Vision and Language with Localized Narratives
European Conference on Computer Vision (ECCV), 2019
6 December 2019
Jordi Pont-Tuset
J. Uijlings
Soravit Changpinyo
Radu Soricut
V. Ferrari
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Connecting Vision and Language with Localized Narratives"
50 / 198 papers shown
Title
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
270
24
0
03 Jul 2024
A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)
Daniel Sonntag
Michael Barz
Thiago S. Gouvêa
VLM
220
6
0
27 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DV
MLLM
311
603
0
24 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&Ro
VLM
189
55
0
17 Jun 2024
From Pixels to Prose: A Large Dataset of Dense Image Captions
Vasu Singla
Kaiyu Yue
Sukriti Paul
Reza Shirkavand
Mayuka Jayawardhana
Alireza Ganjdanesh
Heng Huang
A. Bhatele
Gowthami Somepalli
Tom Goldstein
3DV
VLM
246
38
0
14 Jun 2024
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
Computer Vision and Pattern Recognition (CVPR), 2024
Xiangheng Shan
Dongyue Wu
Guilin Zhu
Yuanjie Shao
Nong Sang
Changxin Gao
VLM
159
25
0
14 Jun 2024
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
226
5
0
13 Jun 2024
Evaluating Vision-Language Models on Bistable Images
Artemis Panagopoulou
Coby Melkin
Chris Callison-Burch
149
2
0
29 May 2024
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Vasu Sharma
Eugenio Culurciello
265
26
0
28 May 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLM
DiffM
330
33
0
24 May 2024
G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024
Zeyu Wang
Yuanchun Shi
Yuntao wang
Yuchen Yao
Kun Yan
Yuhan Wang
Lei Ji
Xuhai Xu
Chun Yu
151
20
0
13 May 2024
What matters when building vision-language models?
Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
248
270
0
03 May 2024
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe
Sunayana Rane
Zachary Berger
Yonatan Bitton
Jaemin Cho
...
Zarana Parekh
Jordi Pont-Tuset
Garrett Tanzer
Su Wang
Jason Baldridge
234
93
0
30 Apr 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
871
30
0
25 Apr 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen
Robert Bamler
Michael Y. Hu
Tal Linzen
Aaron Mueller
Candace Ross
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
Chengxu Zhuang
269
36
0
09 Apr 2024
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
247
4
0
27 Mar 2024
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
Neural Information Processing Systems (NeurIPS), 2024
Jialu Li
Jaemin Cho
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
MoMe
DiffM
193
15
0
11 Mar 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLM
VLM
315
74
0
26 Feb 2024
Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation
Jialei Chen
Daisuke Deguchi
Chenkai Zhang
Hiroshi Murase
VLM
212
1
0
21 Feb 2024
Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk
Jacek Mańdziuk
289
6
0
21 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Jiaming Song
Yu Qiao
Jifeng Dai
AuLLM
195
68
0
18 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
135
2
0
09 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
192
6
0
06 Jan 2024
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Kun Yan
Lei Ji
Zeyu Wang
Yuntao Wang
Nan Duan
Shuai Ma
209
20
0
22 Dec 2023
Pixel Aligned Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
227
17
0
14 Dec 2023
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Computer Vision and Pattern Recognition (CVPR), 2023
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Adriana Romero Soriano
CLIP
3DV
284
82
0
14 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Neural Information Processing Systems (NeurIPS), 2023
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
200
13
0
08 Dec 2023
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
Computer Vision and Pattern Recognition (CVPR), 2023
M. S. Seyfioglu
Wisdom O. Ikezogwo
Fatemeh Ghezloo
Ranjay Krishna
Linda G. Shapiro
397
77
0
07 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
European Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
186
17
0
05 Dec 2023
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
Jiao Sun
Deqing Fu
Yushi Hu
Su Wang
Royi Rassin
...
Dana Alon
Charles Herrmann
Sjoerd van Steenkiste
Ranjay Krishna
Cyrus Rashtchian
EGVM
160
54
0
29 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
123
5
0
28 Nov 2023
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xie
Jiale Cao
Jin Xie
Fahad Shahbaz Khan
Yanwei Pang
VLM
193
86
0
27 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
297
353
0
10 Nov 2023
Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
International Joint Conference on Artificial Intelligence (IJCAI), 2023
Tianrui Hui
Zihan Ding
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Jiao Dai
Jizhong Han
Si Liu
260
7
0
02 Nov 2023
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
International Conference on Learning Representations (ICLR), 2023
Jaemin Cho
Yushi Hu
Roopal Garg
Peter Anderson
Ranjay Krishna
Jason Baldridge
Mohit Bansal
Jordi Pont-Tuset
Su Wang
EGVM
273
118
0
27 Oct 2023
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network
Industrial Conference on Data Mining (IDM), 2023
Yiming Lin
Xiao-Bo Jin
Qiufeng Wang
Kaizhu Huang
137
5
0
25 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
155
6
0
20 Oct 2023
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Haowei Wang
Jiayi Ji
Tianyu Guo
Yilong Yang
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
272
8
0
17 Oct 2023
Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Seungju Han
Junhyeok Kim
Jack Hessel
Liwei Jiang
Jiwan Chung
Yejin Son
Yejin Choi
Youngjae Yu
136
5
0
16 Oct 2023
ALT-Pilot: Autonomous navigation with Language augmented Topometric maps
Mohammad Omama
Pranav Inani
Pranjal Paul
Sarat Chandra Yellapragada
Krishna Murthy Jatavallabhula
Sandeep Chinchali
Madhava Krishna
125
20
0
03 Oct 2023
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation
Jialei Chen
Daisuke Deguchi
Chenkai Zhang
Xu Zheng
Hiroshi Murase
VLM
235
9
0
03 Oct 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
International Conference on Learning Representations (ICLR), 2023
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
230
271
0
20 Sep 2023
Collecting Visually-Grounded Dialogue with A Game Of Sorts
International Conference on Language Resources and Evaluation (LREC), 2023
Bram Willemsen
Dmytro Kalpakchi
Gabriel Skantze
81
2
0
10 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
235
2
0
06 Sep 2023
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
...
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
MLLM
191
160
0
05 Sep 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Ziyan Yang
Kushal Kafle
Zhe Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
200
1
0
24 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
131
6
0
17 Aug 2023
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
IEEE International Conference on Computer Vision (ICCV), 2023
Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
VLM
CLIP
178
80
0
16 Aug 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
371
97
0
12 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Computer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
302
19
0
11 Aug 2023
Previous
1
2
3
4
Next