Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,318 papers shown
Title
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
1.0K
0
0
21 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
International Conference on Multimedia Retrieval (ICMR), 2024
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
334
4
0
21 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Neural Information Processing Systems (NeurIPS), 2024
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
358
13
0
21 Feb 2025
Contrastive Localized Language-Image Pre-Training
Hong-You Chen
Zhengfeng Lai
Hao Zhang
Xiang Wang
Marcin Eichner
Keen You
Meng Cao
Bowen Zhang
Yue Yang
Zhe Gan
CLIP
VLM
271
22
0
20 Feb 2025
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
Henry Hengyuan Zhao
Wenqi Pei
Yifei Tao
Haiyang Mei
Mike Zheng Shou
388
0
0
20 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Hui Yuan
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Longji Xu
179
1
0
19 Feb 2025
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
L. Yang
Xinchen Zhang
Ye Tian
Chenming Shang
Minghao Xu
Wentao Zhang
Tengjiao Wang
296
9
0
17 Feb 2025
How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions
Na Min An
Eunki Kim
Wan Ju Kang
Sangryul Kim
Hyunjung Shim
Hyunjung Shim
240
2
0
15 Feb 2025
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala
Vasileios Tsouvalas
N. Meratnia
MoE
204
0
0
10 Feb 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
International Conference on Learning Representations (ICLR), 2025
Marco Mistretta
Alberto Baldrati
Lorenzo Agnolucci
Marco Bertini
Andrew D. Bagdanov
CLIP
VLM
387
14
0
06 Feb 2025
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
H. Malik
Fahad Shamshad
Muzammal Naseer
Karthik Nandakumar
Fahad Shahbaz Khan
Salman Khan
AAML
MLLM
VLM
378
5
0
03 Feb 2025
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Shenghao Fu
Q. Yang
Qijie Mo
Junkai Yan
Xihan Wei
Jingke Meng
Xiaohua Xie
Wei-Shi Zheng
MLLM
ObjD
VLM
335
27
0
31 Jan 2025
Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation
Reza Akbarian Bafghi
Carden Bagwell
Avinash Ravichandran
Ashish Shrivastava
M. Raissi
204
4
0
28 Jan 2025
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Ahmad Süleyman
Göksel Biricik
343
3
0
15 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Computer Vision and Pattern Recognition (CVPR), 2023
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Yuan Liu
Kaipeng Zhang
Dahua Lin
Yu Qiao
Shiyang Feng
Xiangyu Yue
MLLM
484
188
0
10 Jan 2025
Classifier-Guided Captioning Across Modalities
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Ariel Shaulov
Tal Shaharabany
E. Shaar
Gal Chechik
Lior Wolf
185
0
0
03 Jan 2025
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension
AAAI Conference on Artificial Intelligence (AAAI), 2025
Yaxian Wang
Henghui Ding
Shuting He
Xudong Jiang
Bifan Wei
Jun Liu
ObjD
205
7
0
03 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
European Conference on Computer Vision (ECCV), 2024
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffM
VLM
225
1
0
03 Jan 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
International Conference on Learning Representations (ICLR), 2024
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Lei Ma
MLLM
VLM
452
89
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Neural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
647
113
0
03 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Neural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
407
70
0
31 Dec 2024
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers
Chao Fan
Qipei Mei
Xiaonan Wang
Xinming Li
134
4
0
31 Dec 2024
Towards Visual Grounding: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
739
26
0
28 Dec 2024
To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation
Jungkyu Kim
Kibok Lee
Taeyoung Park
296
3
0
26 Dec 2024
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
374
65
0
22 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Computer Vision and Pattern Recognition (CVPR), 2024
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIP
VLM
268
30
0
20 Dec 2024
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Zhiqiang Tang
Zihan Zhong
Tong He
Gerald Friedland
323
4
0
19 Dec 2024
I0T: Embedding Standardization Method Towards Zero Modality Gap
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Na Min An
Eunki Kim
James Thorne
Hyunjung Shim
VLM
297
2
0
18 Dec 2024
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang
Yi Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
...
Xingtai Lv
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
Maosong Sun
MLLM
VLM
291
3
0
18 Dec 2024
M
3
^3
3
-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
Computer Vision and Pattern Recognition (CVPR), 2024
Zixuan Chen
Jiaxin Li
Liming Tan
Yejie Guo
Junxuan Liang
Cewu Lu
Yongqian Li
VOS
321
0
0
18 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
Computer Vision and Pattern Recognition (CVPR), 2024
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLM
CLIP
268
17
0
04 Dec 2024
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Hao Wu
Zhihang Zhong
Xiao Sun
DiffM
198
1
0
02 Dec 2024
CIA: Controllable Image Augmentation Framework Based on Stable Diffusion
Conference on Multimedia Information Processing and Retrieval (MIPR), 2024
Mohamed Benkedadra
Dany Rimez
Tiffanie Godelaine
Natarajan Chidambaram
Hamed Razavi Khosroshahi
Horacio Tellez
Matei Mancas
Benoît Macq
Sidi Ahmed Mahmoudi
DiffM
206
2
0
25 Nov 2024
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Computer Vision and Pattern Recognition (CVPR), 2024
Hongxu Chen
Runshi Li
Bowei Zhu
Zhen Wang
Long Chen
MoMe
344
4
0
21 Nov 2024
AI-generated Image Detection: Passive or Watermark?
Moyang Guo
Yuepeng Hu
Zhengyuan Jiang
Zeyu Li
Amir Sadovnik
Arka Daw
Neil Zhenqiang Gong
392
2
0
20 Nov 2024
Joint Vision-Language Social Bias Removal for CLIP
Computer Vision and Pattern Recognition (CVPR), 2024
Haoyu Zhang
Yangyang Guo
Mohan S. Kankanhalli
VLM
375
9
0
19 Nov 2024
SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats
Ruoxi Sun
Jiamin Chang
Hammond Pearce
Chaowei Xiao
B. Li
Qi Wu
Surya Nepal
Minhui Xue
577
0
0
17 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
222
75
0
15 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Hao Sun
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
144
7
0
14 Nov 2024
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo
Wei Fan
Baichun Wei
Jianfei Zhu
Jin Tian
Chunzhi Yi
Feng Jiang
201
0
0
13 Nov 2024
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youssef Mohamed
Runjia Li
Ibrahim Said Ahmad
Kilichbek Haydarov
Juil Sock
Kenneth Church
Mohamed Elhoseiny
VLM
155
15
0
06 Nov 2024
HumanVLM: Foundation for Human-Scene Vision-Language Model
Information Fusion (Inf. Fusion), 2024
Dawei Dai
Xu Long
Li Yutang
Zhang YuanHui
Shuyin Xia
VLM
MLLM
275
6
0
05 Nov 2024
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yang Liu
Sensen Gao
Qing Guo
Ke Ma
Yihao Huang
Simeng Qin
Yang Liu
Ivor Tsang Fellow
Xiaochun Cao
AAML
167
7
0
04 Nov 2024
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Neural Information Processing Systems (NeurIPS), 2024
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGe
CLIP
251
34
0
04 Nov 2024
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Anurag Bagchi
Zhipeng Bao
Yu-Xiong Wang
P. Tokmakov
Martial Hebert
VOS
215
2
0
30 Oct 2024
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Donghoon Kim
Gusang Lee
Kyuhong Shim
B. Shim
242
5
0
29 Oct 2024
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Pattern Recognition (Pattern Recogn.), 2024
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
133
4
0
24 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
319
82
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
219
3
0
21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Haobin Li
Peng Hu
Qianjun Zhang
Xi Peng
Xiting Liu
Mouxing Yang
TTA
244
8
0
21 Oct 2024
Previous
1
2
3
4
5
6
...
25
26
27
Next