Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1912.03098
Cited By
v1
v2
v3
v4 (latest)
Connecting Vision and Language with Localized Narratives
European Conference on Computer Vision (ECCV), 2019
6 December 2019
Jordi Pont-Tuset
J. Uijlings
Soravit Changpinyo
Radu Soricut
V. Ferrari
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Connecting Vision and Language with Localized Narratives"
50 / 199 papers shown
Title
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
AAAI Conference on Artificial Intelligence (AAAI), 2025
Xu Zhang
Jin Yuan
Hanwang Zhang
Guojin Zhong
Yongsheng Zang
Jiacheng Lin
Zhiyong Li
DiffM
VLM
64
1
0
01 Dec 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
113
0
0
25 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
284
1
0
06 Nov 2025
VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation
Mateo Guaman Castro
Sidharth Rajagopal
Daniel Gorbatov
Matt Schmittle
R. Baijal
...
Sidharth Talia
Emma Romig
Celso de Melo
Byron Boots
Abhishek Gupta
LM&Ro
115
0
0
23 Oct 2025
FineVision: Open Data Is All You Need
Luis Wiedmann
Orr Zohar
Amir Mahla
Xiaohan Wang
Rui Li
Thibaud Frere
Leandro von Werra
Aritra Roy Gosthipaty
Andrés Marafioti
VLM
180
12
0
20 Oct 2025
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman
Esteban Carlin
Hsing-Kuo Pao
Giovanni Beltrame
Ghaluh Indah Permata Sari
Yie-Tarng Chen
100
0
0
12 Oct 2025
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu
Suchir Salhan
Andrew Caines
P. Buttery
VLM
92
0
0
09 Oct 2025
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi
Giacomo Pacini
F. Carrara
Nicola Messina
Giuseppe Amato
Fabrizio Falchi
VLM
138
0
0
03 Oct 2025
Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
Ece Takmaz
Lisa Bylinina
Jakub Dotlacil
MoMe
156
0
0
02 Oct 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda
Yuiga Wada
Shinnosuke Hirano
Seitaro Otsuki
Komei Sugiura
VLM
120
0
0
30 Sep 2025
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Israfel Salazar
Desmond Elliott
Yova Kementchedjhieva
CoGe
VLM
187
0
0
23 Sep 2025
MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation
Zhicheng Du
Qingyang Shi
J. Lu
Yingshan Liang
Xinyu Zhang
Y. Wang
Peiwu Qin
87
0
0
22 Sep 2025
Case-Based Decision-Theoretic Decoding with Quality Memories
Hiroyuki Deguchi
Masaaki Nagata
141
0
0
16 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
272
3
0
12 Sep 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOS
VLM
228
1
0
29 Aug 2025
OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
Monika Wysoczańska
Shyamal Buch
Anurag Arnab
Cordelia Schmid
HILM
160
0
0
25 Jul 2025
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Yusuke Hirota
Boyi Li
Ryo Hachiuma
Yueh-Hua Wu
Boris Ivanovic
Yuta Nakashima
Marco Pavone
Yejin Choi
Yu-Chun Wang
Chao-Han Huck Yang
VLM
179
1
0
25 Jul 2025
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Yiming Ren
Zhiqiang Lin
Yu Li
Gao Meng
Weiyun Wang
...
Zicheng Lin
Jifeng Dai
Yujiu Yang
Wenhai Wang
Ruihang Chu
152
3
0
17 Jul 2025
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
Fan Yang
Yousong Zhu
Xin Li
Yufei Zhan
Hongyin Zhao
Shurong Zheng
Yaowei Wang
Ming Tang
Jinqiao Wang
MLLM
VLM
210
0
0
20 Jun 2025
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Eunkyu Park
Minyeong Kim
Gunhee Kim
MLLM
HILM
VLM
331
2
0
12 Jun 2025
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang
Zhengyuan Yang
Chao Feng
Yongyuan Liang
Yuhang Zhou
...
Chung-Ching Lin
Kevin Lin
Linjie Li
Furong Huang
L. xilinx Wang
OffRL
LRM
278
8
0
11 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
248
1
0
02 Jun 2025
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Yiming Lei
Zhizheng Yang
Zeming Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
227
0
0
29 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
406
1
0
23 May 2025
DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
Qirui Jiao
Daoyuan Chen
Yilun Huang
Xika Lin
Ying Shen
Yaliang Li
VLM
194
1
0
22 May 2025
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
Computer Vision and Pattern Recognition (CVPR), 2025
Ziyu Zhao
Xiaoguang Li
Linjia Shi
Nasrin Imanpour
Song Wang
VLM
194
2
0
16 May 2025
Learning Graph Representation of Agent Diffusers
Adaptive Agents and Multi-Agent Systems (AAMAS), 2025
Youcef Djenouri
Nassim Belmecheri
Tomasz Michalak
Jan Dubiñski
Ahmed Nabil Belbachir
Anis Yazidi
AI4CE
443
0
0
10 May 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu
Mengjie Liu
Jianfei Chen
Jingwei Xu
Tengjiao Wang
Bin Wang
Wentao Zhang
MLLM
433
1
0
14 Apr 2025
Latent Beam Diffusion Models for Generating Visual Sequences
Guilherme Fernandes
Vasco Ramos
Regev Cohen
Idan Szpektor
João Magalhães
320
1
0
26 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jinfa Huang
Jie Lou
Debing Zhang
Rongrong Ji
447
6
0
26 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
893
9
0
08 Mar 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
522
12
0
26 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
Lucas Charpentier
Leshem Choshen
Robert Bamler
Mustafa Omer Gul
Michael Y. Hu
...
Candace Ross
Raj Sanjay Shah
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
278
25
0
15 Feb 2025
Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
AAAI Conference on Artificial Intelligence (AAAI), 2024
Prajwal Gatti
Kshitij Parikh
Dhriti Prasanna Paul
Manish Gupta
Anand Mishra
468
4
0
12 Feb 2025
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Sami Baral
L. Lucy
Ryan Knight
Alice Ng
Luca Soldaini
Neil T. Heffernan
Kyle Lo
268
10
0
28 Jan 2025
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Michael Y. Hu
Aaron Mueller
Candace Ross
Adina Williams
Tal Linzen
Chengxu Zhuang
Robert Bamler
Leshem Choshen
Alex Warstadt
Ethan Gotlieb Wilcox
447
36
0
06 Dec 2024
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Computer Vision and Pattern Recognition (CVPR), 2024
Xudong Lu
Yinghao Chen
Cheng Chen
Hui Tan
Boheng Chen
...
Aojun Zhou
Yafei Wen
Xiaoxin Chen
Shuai Ren
Jiaming Song
165
19
0
16 Nov 2024
SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset
Ngoc Dung Huynh
Mohamed Reda Bouadjenek
Sunil Aryal
Imran Razzak
Hakim Hacid
217
0
0
30 Oct 2024
Offline Evaluation of Set-Based Text-to-Image Generation
Negar Arabzadeh
Fernando Diaz
Junfeng He
EGVM
184
1
0
22 Oct 2024
Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training
Rohan Saha
Abrar Fahim
Alona Fyshe
Alex Murphy
163
3
0
20 Oct 2024
Compositional Entailment Learning for Hyperbolic Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Avik Pal
Max van Spengler
Guido Maria DÁmely di Melendugno
Alessandro Flaborea
Fabio Galasso
Pascal Mettes
CoGe
329
31
0
09 Oct 2024
Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
ACM Multimedia (MM), 2024
Hongyu Li
Tianrui Hui
Zihan Ding
Jing Zhang
Bin Ma
Xiaoming Wei
Jizhong Han
Si Liu
DiffM
193
4
0
12 Sep 2024
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
European Conference on Computer Vision (ECCV), 2024
Yang Liu
Pengxiang Ding
Siteng Huang
Min Zhang
Han Zhao
Donglin Wang
182
8
0
11 Sep 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
282
129
0
22 Aug 2024
TraDiffusion: Trajectory-Based Training-Free Image Generation
Mingrui Wu
Oucheng Huang
Jiayi Ji
Jiale Li
Xinyue Cai
Huafeng Kuang
Jianzhuang Liu
Xiaoshuai Sun
Rongrong Ji
187
4
0
19 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
European Conference on Computer Vision (ECCV), 2024
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
389
3
0
28 Jul 2024
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
267
28
0
11 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
337
7
0
09 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Yuan Yao
Heng Ji
LRM
265
28
0
08 Jul 2024
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Haozhe Zhao
Xiaojian Ma
Liang Chen
Shuzheng Si
Rujie Wu
Kaikai An
Peiyu Yu
Minjia Zhang
Qing Li
Baobao Chang
348
157
0
07 Jul 2024
1
2
3
4
Next