Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,318 papers shown
Title
Deepfakes: we need to re-think the concept of "real" images
J. Keuper
Margret Keuper
90
0
0
26 Sep 2025
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
Teng Xiao
Zuchao Li
Lefei Zhang
121
0
0
23 Sep 2025
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Israfel Salazar
Desmond Elliott
Yova Kementchedjhieva
CoGe
VLM
151
0
0
23 Sep 2025
Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
Wenhuan Lu
Xinyue Song
Wenjun Ke
Zhizhi Yu
Wenhao Yang
Jianguo Wei
ObjD
76
0
0
20 Sep 2025
RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
Xiaosheng Long
Hanyu Wang
Zhentao Song
Kun Luo
Hongde Liu
84
0
0
19 Sep 2025
MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation
Yu Chang
Jiahao Chen
Anzhe Cheng
Paul Bogdan
DiffM
61
0
0
18 Sep 2025
Efficient Multimodal Dataset Distillation via Generative Models
Zhenghao Zhao
Haoxuan Wang
Junyi Wu
Yuzhang Shang
Gaowen Liu
Yan Yan
DD
211
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
91
1
0
17 Sep 2025
Evaluating Robustness of Vision-Language Models Under Noisy Conditions
Purushoth
Alireza
AAML
76
0
0
15 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
236
1
0
12 Sep 2025
Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
120
1
0
10 Sep 2025
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Jiangnan Xie
Xiaolong Zheng
Liang Zheng
ObjD
129
0
0
08 Sep 2025
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Davide Berghi
Philip J. B. Jackson
68
0
0
08 Sep 2025
Effectively obtaining acoustic, visual and textual data from videos
Jorge E. León
Miguel Carrasco
VGen
111
1
0
06 Sep 2025
Semantic-guided LoRA Parameters Generation
Miaoge Li
Yang Chen
Zhijie Rao
Can Jiang
Jingcai Guo
OffRL
80
0
0
05 Sep 2025
Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
Reina Ishikawa
Ryo Fujii
Hideo Saito
Ryo Hachiuma
108
0
0
03 Sep 2025
EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
Dinh-Khoi Vo
Van-Loc Nguyen
M. Tran
T. Le
3DV
VGen
44
0
0
31 Aug 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOS
VLM
188
1
0
29 Aug 2025
Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
Jonghyun Song
Youngjune Lee
Gyu-Hwung Cho
Ilhyeon Song
Saehun Kim
Yohan Jo
VLM
52
0
0
22 Aug 2025
RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Haodong He
Y. Bai
Rui Lan
Xu Duan
Lei Sun
Xiangxiang Chu
Gui-Song Xia
DiffM
70
1
0
22 Aug 2025
Towards Open World Detection: A Survey
Andrei-Stefan Bulzan
Cosmin Cernazanu-Glavan
ObjD
VLM
159
0
0
22 Aug 2025
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Shanlin Sun
Yifan Wang
Hanwen Zhang
Yifeng Xiong
Qin Ren
Ruogu Fang
Xiaohui Xie
Chenyu You
126
2
0
20 Aug 2025
Understanding Data Influence with Differential Approximation
Haoru Tan
Sitong Wu
Xiuzhe Wu
Wang Wang
Bo Zhao
Zeke Xie
Gui-Song Xia
Xiaojuan Qi
TDI
206
1
0
20 Aug 2025
7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Elena Izzo
Luca Parolari
Davide Vezzaro
Lamberto Ballan
52
0
0
18 Aug 2025
Region-Level Context-Aware Multimodal Understanding
Hongliang Wei
Xianqi Zhang
Xingtao Wang
Xiaopeng Fan
Debin Zhao
VLM
125
0
0
17 Aug 2025
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
Yuchen Zhou
Jiayu Tang
Shuo Yang
Xiaoyan Xiao
Yuqin Dai
Wenhao Yang
Chao Gou
Xiaobo Xia
Tat-Seng Chua
VLM
CoGe
LRM
109
1
0
15 Aug 2025
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard
Mehrzad Mohammadi
Yi Shen
Zhixi Cai
Hamid Rezatofighi
225
1
0
14 Aug 2025
Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Yipeng Zhang
Hongju Yu
Aritra Mandal
Canran Xu
Qunzhi Zhou
Zhe Wu
132
0
0
13 Aug 2025
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu
Zhibo Yang
Yuliang Liu
Xiang Bai
MLLM
OffRL
LRM
64
3
0
12 Aug 2025
ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Weitai Kang
Weiming Zhuang
Zhizhong Li
Yan Yan
Lingjuan Lyu
82
0
0
11 Aug 2025
MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
Haiyang Guo
Fei Zhu
Hongbo Zhao
Fanhu Zeng
Wenzhuo Liu
Shijie Ma
Da-Han Wang
Xu-Yao Zhang
CLL
166
2
0
10 Aug 2025
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen
Ruihui Zhao
Chuwei Luo
Mingze Sun
Xinlei Yu
Yangyang Kang
Ruqi Huang
LRM
177
4
0
08 Aug 2025
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Hao Dong
Lijun Sheng
Jian Liang
Ran He
Eleni Chatzi
Olga Fink
OffRL
VLM
164
3
0
07 Aug 2025
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
Y. Wang
Tao Wang
Chenwei Tang
Caiyang Yu
Zhengqing Zang
Mengmi Zhang
Shudong Huang
Jiancheng Lv
VLM
87
0
0
06 Aug 2025
ChartCap: Mitigating Hallucination of Dense Chart Captioning
Junyoung Lim
Jaewoo Ahn
Gunhee Kim
88
1
0
05 Aug 2025
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
Ziteng Wang
Siqi Yang
Limeng Qiao
Lin Ma
VLM
221
0
0
04 Aug 2025
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment
Dahun Kim
A. Angelova
VLM
163
0
0
03 Aug 2025
Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis
Anzhe Cheng
Chenzhong Yin
Mingxi Cheng
Shukai Duan
Shahin Nazarian
Paul Bogdan
176
0
0
02 Aug 2025
Session-Based Recommendation with Validated and Enriched LLM Intents
G. G. Lee
Y. Liu
Yifan Liu
Susik Yoon
Dong Wang
SeongKu Kang
155
2
0
01 Aug 2025
Multimodal Referring Segmentation: A Survey
Henghui Ding
Song Tang
Shuting He
Chang-rui Liu
Zuxuan Wu
Yu-Gang Jiang
306
10
0
01 Aug 2025
Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment
Kaiyan Zhao
Zhongtao Miao
Yoshimasa Tsuruoka
74
1
0
01 Aug 2025
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
Hyundong Jin
Hyung Jin Chang
Eunwoo Kim
VLM
85
0
0
01 Aug 2025
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Dohwan Ko
Ji Soo Lee
M. Choi
Zihang Meng
Hyunwoo J. Kim
236
1
0
31 Jul 2025
On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations
Jordan Vice
Naveed Akhtar
Yansong Gao
Richard Hartley
Ajmal Mian
AAML
155
1
0
30 Jul 2025
Trade-offs in Image Generation: How Do Different Dimensions Interact?
Sicheng Zhang
Binzhu Xie
Zhonghao Yan
Yuli Zhang
Donghao Zhou
Xiaofei Chen
Shi Qiu
Jiaqi Liu
Guoyang Xie
Zhichao Lu
115
2
0
29 Jul 2025
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Tianhong Gao
Yannian Fu
Weiqun Wu
Haixiao Yue
Shanshan Liu
Gang Zhang
MLLM
LRM
169
1
0
29 Jul 2025
On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
Meishan Zhang
Xin Zhang
X. Zhao
Shouzheng Huang
Baotian Hu
Min Zhang
169
3
0
28 Jul 2025
ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning
Duc-Tai Dinh
Duc Anh Khoa Dinh
VLM
56
0
0
28 Jul 2025
Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation
Xinshu Li
Ruoyu Wang
Erdun Gao
Mingming Gong
Lina Yao
DiffM
127
0
0
26 Jul 2025
Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection
Yehao Lu
Minghe Weng
Zekang Xiao
Rui Jiang
Wei Su
Guangcong Zheng
Ping Lu
Xi Li
MoE
ObjD
116
0
0
23 Jul 2025
Previous
1
2
3
4
5
...
25
26
27
Next