ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1912.03098
  4. Cited By
Connecting Vision and Language with Localized Narratives
v1v2v3v4 (latest)

Connecting Vision and Language with Localized Narratives

European Conference on Computer Vision (ECCV), 2019
6 December 2019
Jordi Pont-Tuset
J. Uijlings
Soravit Changpinyo
Radu Soricut
V. Ferrari
    ObjD
ArXiv (abs)PDFHTML

Papers citing "Connecting Vision and Language with Localized Narratives"

50 / 198 papers shown
Title
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
270
24
0
03 Jul 2024
A look under the hood of the Interactive Deep Learning Enterprise
  (No-IDLE)
A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)
Daniel Sonntag
Michael Barz
Thiago S. Gouvêa
VLM
220
6
0
27 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DVMLLM
311
603
0
24 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&RoVLM
189
55
0
17 Jun 2024
From Pixels to Prose: A Large Dataset of Dense Image Captions
From Pixels to Prose: A Large Dataset of Dense Image Captions
Vasu Singla
Kaiyu Yue
Sukriti Paul
Reza Shirkavand
Mayuka Jayawardhana
Alireza Ganjdanesh
Heng Huang
A. Bhatele
Gowthami Somepalli
Tom Goldstein
3DVVLM
246
38
0
14 Jun 2024
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
Open-Vocabulary Semantic Segmentation with Image Embedding BalancingComputer Vision and Pattern Recognition (CVPR), 2024
Xiangheng Shan
Dongyue Wu
Guilin Zhu
Yuanjie Shao
Nong Sang
Changxin Gao
VLM
159
25
0
14 Jun 2024
Comparison Visual Instruction Tuning
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
226
5
0
13 Jun 2024
Evaluating Vision-Language Models on Bistable Images
Evaluating Vision-Language Models on Bistable Images
Artemis Panagopoulou
Coby Melkin
Chris Callison-Burch
149
2
0
29 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Vasu Sharma
Eugenio Culurciello
265
26
0
28 May 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLMDiffM
330
33
0
24 May 2024
G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024
Zeyu Wang
Yuanchun Shi
Yuntao wang
Yuchen Yao
Kun Yan
Yuhan Wang
Lei Ji
Xuhai Xu
Chun Yu
151
20
0
13 May 2024
What matters when building vision-language models?
What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
248
270
0
03 May 2024
DOCCI: Descriptions of Connected and Contrasting Images
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe
Sunayana Rane
Zachary Berger
Yonatan Bitton
Jaemin Cho
...
Zarana Parekh
Jordi Pont-Tuset
Garrett Tanzer
Su Wang
Jason Baldridge
234
93
0
30 Apr 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
871
30
0
25 Apr 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining
  on a developmentally plausible corpus
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen
Robert Bamler
Michael Y. Hu
Tal Linzen
Aaron Mueller
Candace Ross
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
Chengxu Zhuang
269
36
0
09 Apr 2024
Toward Interactive Regional Understanding in Vision-Large Language
  Models
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
247
4
0
27 Mar 2024
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
  Auto-Generated Data
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated DataNeural Information Processing Systems (NeurIPS), 2024
Jialu Li
Jaemin Cho
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
MoMeDiffM
193
15
0
11 Mar 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLMVLM
315
74
0
26 Feb 2024
Generalizable Semantic Vision Query Generation for Zero-shot Panoptic
  and Semantic Segmentation
Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation
Jialei Chen
Daisuke Deguchi
Chenkai Zhang
Hiroshi Murase
VLM
212
1
0
21 Feb 2024
Text-to-Image Cross-Modal Generation: A Systematic Review
Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk
Jacek Mańdziuk
289
6
0
21 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via
  Multi-modal Feature Synchronizer
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Jiaming Song
Yu Qiao
Jifeng Dai
AuLLM
195
68
0
18 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
135
2
0
09 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
192
6
0
06 Jan 2024
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Kun Yan
Lei Ji
Zeyu Wang
Yuntao Wang
Nan Duan
Shuai Ma
209
20
0
22 Dec 2023
Pixel Aligned Language Models
Pixel Aligned Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLMVLM
227
17
0
14 Dec 2023
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style
  Models on Dense Captions
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense CaptionsComputer Vision and Pattern Recognition (CVPR), 2023
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Adriana Romero Soriano
CLIP3DV
284
82
0
14 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Localized Symbolic Knowledge Distillation for Visual Commonsense ModelsNeural Information Processing Systems (NeurIPS), 2023
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
200
13
0
08 Dec 2023
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology VideosComputer Vision and Pattern Recognition (CVPR), 2023
M. S. Seyfioglu
Wisdom O. Ikezogwo
Fatemeh Ghezloo
Ranjay Krishna
Linda G. Shapiro
397
77
0
07 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Mismatch Quest: Visual and Textual Feedback for Image-Text MisalignmentEuropean Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
186
17
0
05 Dec 2023
DreamSync: Aligning Text-to-Image Generation with Image Understanding
  Feedback
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
Jiao Sun
Deqing Fu
Yushi Hu
Su Wang
Royi Rassin
...
Dana Alon
Charles Herrmann
Sjoerd van Steenkiste
Ranjay Krishna
Cyrus Rashtchian
EGVM
160
54
0
29 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
123
5
0
28 Nov 2023
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic SegmentationComputer Vision and Pattern Recognition (CVPR), 2023
Bin Xie
Jiale Cao
Jin Xie
Fahad Shahbaz Khan
Yanwei Pang
VLM
193
86
0
27 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision
  Tasks
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
297
353
0
10 Nov 2023
Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
  Narrative Grounding
Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative GroundingInternational Joint Conference on Artificial Intelligence (IJCAI), 2023
Tianrui Hui
Zihan Ding
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Jiao Dai
Jizhong Han
Si Liu
260
7
0
02 Nov 2023
Davidsonian Scene Graph: Improving Reliability in Fine-grained
  Evaluation for Text-to-Image Generation
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image GenerationInternational Conference on Learning Representations (ICLR), 2023
Jaemin Cho
Yushi Hu
Roopal Garg
Peter Anderson
Ranjay Krishna
Jason Baldridge
Mohit Bansal
Jordi Pont-Tuset
Su Wang
EGVM
273
118
0
27 Oct 2023
Context Does Matter: End-to-end Panoptic Narrative Grounding with
  Deformable Attention Refined Matching Network
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching NetworkIndustrial Conference on Data Mining (IDM), 2023
Yiming Lin
Xiao-Bo Jin
Qiufeng Wang
Kaizhu Huang
137
5
0
25 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
155
6
0
20 Oct 2023
NICE: Improving Panoptic Narrative Detection and Segmentation with
  Cascading Collaborative Learning
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Haowei Wang
Jiayi Ji
Tianyu Guo
Yilong Yang
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
272
8
0
17 Oct 2023
Reading Books is Great, But Not if You Are Driving! Visually Grounded
  Reasoning about Defeasible Commonsense Norms
Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense NormsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Seungju Han
Junhyeok Kim
Jack Hessel
Liwei Jiang
Jiwan Chung
Yejin Son
Yejin Choi
Youngjae Yu
136
5
0
16 Oct 2023
ALT-Pilot: Autonomous navigation with Language augmented Topometric maps
ALT-Pilot: Autonomous navigation with Language augmented Topometric maps
Mohammad Omama
Pranav Inani
Pranjal Paul
Sarat Chandra Yellapragada
Krishna Murthy Jatavallabhula
Sandeep Chinchali
Madhava Krishna
125
20
0
03 Oct 2023
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
  Zero-shot Semantic Segmentation
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation
Jialei Chen
Daisuke Deguchi
Chenkai Zhang
Xu Zheng
Hiroshi Murase
VLM
235
9
0
03 Oct 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
DreamLLM: Synergistic Multimodal Comprehension and CreationInternational Conference on Learning Representations (ICLR), 2023
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
230
271
0
20 Sep 2023
Collecting Visually-Grounded Dialogue with A Game Of Sorts
Collecting Visually-Grounded Dialogue with A Game Of SortsInternational Conference on Language Resources and Evaluation (LREC), 2023
Bram Willemsen
Dmytro Kalpakchi
Gabriel Skantze
81
2
0
10 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and
  Language Models
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
235
2
0
06 Sep 2023
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
  Tuning
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
...
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
MLLM
191
160
0
05 Sep 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
SCoRD: Subject-Conditional Relation Detection with Text-Augmented DataIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Ziyan Yang
Kushal Kafle
Zhe Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
200
1
0
24 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language
  Tasks
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
131
6
0
17 Aug 2023
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
ALIP: Adaptive Language-Image Pre-training with Synthetic CaptionIEEE International Conference on Computer Vision (ICCV), 2023
Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
VLMCLIP
178
80
0
16 Aug 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following
  Inspired by Real-World Use
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
371
97
0
12 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Taming Self-Training for Open-Vocabulary Object DetectionComputer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLMObjD
302
19
0
11 Aug 2023
Previous
1234
Next