Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Zhengpeng Shi
Hengli Li
Yanpeng Zhao
Jianqun Zhou
Yuxuan Wang
Qinrong Cui
Wei Bi
Songchun Zhu
Bo Zhao
Zilong Zheng
VLM
122
0
0
30 Sep 2025
MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu
Hao Fei
Yuhui Zhang
Liangming Pan
Qijun Huang
...
Preslav Nakov
Min-Yen Kan
William Y. Wang
Mong-Li Lee
Wynne Hsu
ReLM
LRM
130
0
0
30 Sep 2025
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Haotian Xue
Yunhao Ge
Y. Zeng
Zhaoshuo Li
Ming-Yu Liu
Yongxin Chen
JiaoJiao Fan
141
1
0
30 Sep 2025
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie
Wenjin Wang
Zhuomeng Zhang
Zihan Liu
Qi Liu
Ke Feng
Zixun Sun
Yuedong Yang
VLM
90
0
0
29 Sep 2025
ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation
Shilan Zhang
J. Huang
Ruilin Yao
Cong Wang
Yaxiong Chen
Peng Xu
Shengwu Xiong
125
0
0
28 Sep 2025
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Zilun Zhang
Zian Guan
T. Zhao
H. Shen
Tianyu Li
Yuxiang Cai
Zhonggen Su
Zhaojun Liu
Jianwei Yin
Xiang Li
ObjD
LRM
243
4
0
26 Sep 2025
Deepfakes: we need to re-think the concept of "real" images
J. Keuper
Margret Keuper
139
0
0
26 Sep 2025
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Israfel Salazar
Desmond Elliott
Yova Kementchedjhieva
CoGe
VLM
230
0
0
23 Sep 2025
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
Teng Xiao
Zuchao Li
Lefei Zhang
187
1
0
23 Sep 2025
Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
Wenhuan Lu
Xinyue Song
Wenjun Ke
Zhizhi Yu
Wenhao Yang
Jianguo Wei
ObjD
96
0
0
20 Sep 2025
RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
Xiaosheng Long
Hanyu Wang
Zhentao Song
Kun Luo
Hongde Liu
136
0
0
19 Sep 2025
MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation
Yu Chang
Jiahao Chen
Anzhe Cheng
Paul Bogdan
DiffM
128
0
0
18 Sep 2025
Efficient Multimodal Dataset Distillation via Generative Models
Zhenghao Zhao
Haoxuan Wang
Junyi Wu
Yuzhang Shang
Gaowen Liu
Yan Yan
DD
287
2
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
127
1
0
17 Sep 2025
Evaluating Robustness of Vision-Language Models Under Noisy Conditions
Purushoth
Alireza
AAML
97
0
0
15 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
324
3
0
12 Sep 2025
Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
188
1
0
10 Sep 2025
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Jiangnan Xie
Xiaolong Zheng
Liang Zheng
ObjD
174
0
0
08 Sep 2025
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Davide Berghi
Philip J. B. Jackson
111
1
0
08 Sep 2025
Effectively obtaining acoustic, visual and textual data from videos
Jorge E. León
Miguel Carrasco
VGen
139
1
0
06 Sep 2025
Semantic-guided LoRA Parameters Generation
Miaoge Li
Yang Chen
Zhijie Rao
Can Jiang
Jingcai Guo
OffRL
116
0
0
05 Sep 2025
Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
Reina Ishikawa
Ryo Fujii
Hideo Saito
Ryo Hachiuma
146
0
0
03 Sep 2025
EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
Dinh-Khoi Vo
Van-Loc Nguyen
M. Tran
T. Le
3DV
VGen
66
0
0
31 Aug 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOS
VLM
261
1
0
29 Aug 2025
Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
Jonghyun Song
Youngjune Lee
Gyu-Hwung Cho
Ilhyeon Song
Saehun Kim
Yohan Jo
VLM
88
0
0
22 Aug 2025
RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Haodong He
Y. Bai
Rui Lan
Xu Duan
Lei Sun
Xiangxiang Chu
Gui-Song Xia
DiffM
126
1
0
22 Aug 2025
Towards Open World Detection: A Survey
Andrei-Stefan Bulzan
Cosmin Cernazanu-Glavan
ObjD
VLM
220
0
0
22 Aug 2025
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Shanlin Sun
Yifan Wang
Hanwen Zhang
Yifeng Xiong
Qin Ren
Ruogu Fang
Xiaohui Xie
Chenyu You
174
4
0
20 Aug 2025
Understanding Data Influence with Differential Approximation
Haoru Tan
Sitong Wu
Xiuzhe Wu
Wang Wang
Bo Zhao
Zeke Xie
Gui-Song Xia
Xiaojuan Qi
TDI
283
1
0
20 Aug 2025
7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Elena Izzo
Luca Parolari
Davide Vezzaro
Lamberto Ballan
111
0
0
18 Aug 2025
Region-Level Context-Aware Multimodal Understanding
Hongliang Wei
Xianqi Zhang
Xingtao Wang
Xiaopeng Fan
Debin Zhao
VLM
165
0
0
17 Aug 2025
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
Yuchen Zhou
Jiayu Tang
Shuo Yang
Xiaoyan Xiao
Yuqin Dai
Wenhao Yang
Chao Gou
Xiaobo Xia
Tat-Seng Chua
VLM
CoGe
LRM
145
2
0
15 Aug 2025
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard
Mehrzad Mohammadi
Yi Shen
Zhixi Cai
Hamid Rezatofighi
294
2
0
14 Aug 2025
Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Yipeng Zhang
Hongju Yu
Aritra Mandal
Canran Xu
Qunzhi Zhou
Zhe Wu
192
0
0
13 Aug 2025
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu
Zhibo Yang
Yuliang Liu
Xiang Bai
MLLM
OffRL
LRM
95
4
0
12 Aug 2025
ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Weitai Kang
Weiming Zhuang
Zhizhong Li
Yan Yan
Lingjuan Lyu
126
1
0
11 Aug 2025
MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
Haiyang Guo
Fei Zhu
Hongbo Zhao
Fanhu Zeng
Wenzhuo Liu
Shijie Ma
Da-Han Wang
Xu-Yao Zhang
CLL
214
2
0
10 Aug 2025
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen
Ruihui Zhao
Chuwei Luo
Mingze Sun
Xinlei Yu
Yangyang Kang
Ruqi Huang
LRM
287
4
0
08 Aug 2025
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Hao Dong
Lijun Sheng
Jian Liang
Ran He
Eleni Chatzi
Olga Fink
OffRL
VLM
219
4
0
07 Aug 2025
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
Y. Wang
Tao Wang
Chenwei Tang
Caiyang Yu
Zhengqing Zang
Mengmi Zhang
Shudong Huang
Jiancheng Lv
VLM
115
0
0
06 Aug 2025
ChartCap: Mitigating Hallucination of Dense Chart Captioning
Junyoung Lim
Jaewoo Ahn
Gunhee Kim
128
2
0
05 Aug 2025
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
Ziteng Wang
Siqi Yang
Limeng Qiao
Lin Ma
VLM
397
0
0
04 Aug 2025
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment
Dahun Kim
A. Angelova
VLM
232
1
0
03 Aug 2025
Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis
Anzhe Cheng
Chenzhong Yin
Mingxi Cheng
Shukai Duan
Shahin Nazarian
Paul Bogdan
225
0
0
02 Aug 2025
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
Hyundong Jin
Hyung Jin Chang
Eunwoo Kim
VLM
142
0
0
01 Aug 2025
SPRINT: Scalable and Predictive Intent Refinement for LLM-Enhanced Session-based Recommendation
G. G. Lee
Y. Liu
Yifan Liu
Susik Yoon
Dong Wang
SeongKu Kang
Dong Wang
SeongKu Kang
191
2
0
01 Aug 2025
Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment
Kaiyan Zhao
Zhongtao Miao
Yoshimasa Tsuruoka
102
1
0
01 Aug 2025
Multimodal Referring Segmentation: A Survey
Henghui Ding
Song Tang
Shuting He
Chang-rui Liu
Zuxuan Wu
Yu-Gang Jiang
394
11
0
01 Aug 2025
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Dohwan Ko
Ji Soo Lee
M. Choi
Zihang Meng
Hyunwoo J. Kim
384
1
0
31 Jul 2025
On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations
Jordan Vice
Naveed Akhtar
Yansong Gao
Richard Hartley
Ajmal Mian
AAML
210
2
0
30 Jul 2025
Previous
1
2
3
4
5
...
25
26
27
Next
Page 2 of 27
Page
of 27
Go