Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,322 papers shown
Title
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Pattern Recognition (Pattern Recogn.), 2024
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
145
4
0
24 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
367
84
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
267
3
0
21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Haobin Li
Peng Hu
Qianjun Zhang
Xi Peng
Xiting Liu
Mouxing Yang
TTA
260
8
0
21 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Neural Information Processing Systems (NeurIPS), 2024
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
596
59
0
18 Oct 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Chenhao Zhang
Xi Feng
Yuelin Bai
Xinrun Du
Jinchang Hou
...
Min Yang
Wenhao Huang
Chenghua Lin
Ge Zhang
Shiwen Ni
ELM
VLM
136
9
0
17 Oct 2024
LocateBench: Evaluating the Locating Ability of Vision Language Models
Ting-Rui Chiang
Joshua Robinson
Xinyan Velocity Yu
Dani Yogatama
VLM
ELM
212
0
0
17 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
ACM Multimedia (ACM MM), 2022
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
345
9
0
16 Oct 2024
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
Qingqing Cao
Mahyar Najibi
Sachin Mehta
CLIP
DiffM
221
1
0
15 Oct 2024
Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models
Fan Yang
Yihao Huang
Kaidi Wang
Ling Shi
G. Pu
Yang Liu
Jian Shu
AAML
VLM
213
2
0
15 Oct 2024
TULIP: Token-length Upgraded CLIP
International Conference on Learning Representations (ICLR), 2024
Ivona Najdenkoska
Mohammad Mahdi Derakhshani
Yuki M. Asano
Nanne van Noord
Marcel Worring
Cees G. M. Snoek
VLM
377
13
0
13 Oct 2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Sungkyung Kim
Adam Lee
Junyoung Park
Andrew Chung
Jusang Oh
Jay-Yoon Lee
96
9
0
12 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Neural Information Processing Systems (NeurIPS), 2024
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
406
20
0
10 Oct 2024
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Jiaqi Wang
Dahua Lin
Weiming Zhang
Nenghai Yu
157
20
0
09 Oct 2024
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Haoran Zhang
Hangyu Guo
Shuyue Guo
Meng Cao
Wenhao Huang
Jiaheng Liu
Ge Zhang
VLM
MLLM
LRM
212
4
0
09 Oct 2024
Compositional Entailment Learning for Hyperbolic Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Avik Pal
Max van Spengler
Guido Maria DÁmely di Melendugno
Alessandro Flaborea
Fabio Galasso
Pascal Mettes
CoGe
329
31
0
09 Oct 2024
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit
Tolga Tasdizen
CoGe
VLM
146
1
0
06 Oct 2024
CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection
Asian Conference on Computer Vision (ACCV), 2024
Devank
Jayateja Kalla
Soma Biswas
146
5
0
06 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
International Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
605
89
0
04 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
287
64
1
30 Sep 2024
HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Fan Yuan
Chi Qin
Xiaogang Xu
Piji Li
VLM
MLLM
145
9
0
30 Sep 2024
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Computer Vision and Pattern Recognition (CVPR), 2024
Mayug Maniparambil
Raiymbek Akshulakov
Y. A. D. Djilali
Sanath Narayan
Ankit Singh
Noel E. O'Connor
VLM
MLLM
132
2
0
28 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Neural Information Processing Systems (NeurIPS), 2024
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
414
37
0
26 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
414
20
0
26 Sep 2024
A-VL: Adaptive Attention for Large Vision-Language Models
AAAI Conference on Artificial Intelligence (AAAI), 2024
Junyang Zhang
Mu Yuan
Ruiguang Zhong
Puhan Luo
Huiyou Zhan
Ningkang Zhang
Chengchen Hu
Xiangyang Li
VLM
329
4
0
23 Sep 2024
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Li Zhou
Xu Yuan
Zenghui Sun
Zikun Zhou
Jingsong Lan
VLM
MLLM
836
7
0
20 Sep 2024
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
V. Bhat
Prashanth Krishnamurthy
Ramesh Karri
Farshad Khorrami
429
9
0
16 Sep 2024
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao
Zhuoyue Wang
Hang Zhang
Lun Wang
VLM
295
26
0
15 Sep 2024
Automatic Scene Generation: State-of-the-Art Techniques, Models, Datasets, Challenges, and Future Prospects
IEEE Access (IEEE Access), 2024
Awal Ahmed Fime
Saifuddin Mahmud
Arpita Das
Md. Sunzidul Islam
Hong-Hoon Kim
VGen
3DV
223
2
0
14 Sep 2024
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Haoxuan Wang
Qu He
Jinlong Peng
Hao Yang
Mingmin Chi
Yabiao Wang
Mamba
229
7
0
13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
186
1
0
12 Sep 2024
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Pengfei Qi
Yifei Zhang
Wenqiang Li
Youwen Hu
Kunlong Bai
ObjD
183
0
0
10 Sep 2024
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
IEEE transactions on multimedia (IEEE TMM), 2024
Jingcheng Ke
Dele Wang
Jun-Cheng Chen
I-Hong Jhuo
Chia-Wen Lin
Yen-Yu Lin
218
1
0
05 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
883
2
0
04 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead
Jacob Phillips
Sean Hendryx
135
0
0
30 Aug 2024
See or Guess: Counterfactually Regularized Image Captioning
ACM Multimedia (MM), 2024
Qian Cao
Xu Chen
Ruihua Song
Xiting Wang
Xinting Huang
Yuchen Ren
CML
174
3
0
29 Aug 2024
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding
ACM Multimedia (MM), 2024
Minghang Zheng
Jiahua Zhang
Qingchao Chen
Yuxin Peng
Yang Liu
ObjD
262
5
0
29 Aug 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
K. Nakata
Daisuke Miyashita
Youyang Ng
Yasuto Hoshi
J. Deguchi
125
0
0
29 Aug 2024
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
190
2
0
28 Aug 2024
Evaluating Attribute Comprehension in Large Vision-Language Models
Chinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2024
Haiwen Zhang
Zixi Yang
Yuanzhi Liu
Xinran Wang
Zheqi He
Kongming Liang
Zhanyu Ma
ELM
170
0
0
25 Aug 2024
Tangram: A Challenging Benchmark for Geometric Element Recognizing
Jiamin Tang
Chao Zhang
Xudong Zhu
Mengchi Liu
LRM
51
1
0
25 Aug 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
AAAI Conference on Artificial Intelligence (AAAI), 2024
Bin Wang
Chunyu Xie
Dawei Leng
Yuhui Yin
MLLM
425
6
0
23 Aug 2024
Towards Deconfounded Image-Text Matching with Causal Inference
ACM Multimedia (ACM MM), 2023
Wenhui Li
Xinqi Su
Dan Song
Lanjun Wang
Kun Zhang
An-An Liu
BDL
CML
196
15
0
22 Aug 2024
RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration
Guoting Wei
Xia Yuan
Yu Liu
Zhenhao Shang
Kelu Yao
Peng Wang
Kelu Yao
Chunxia Zhao
Haokui Zhang
Rong Xiao
ObjD
VLM
563
0
0
22 Aug 2024
A Survey on Integrated Sensing, Communication, and Computation
IEEE Communications Surveys and Tutorials (COMST), 2024
Dingzhu Wen
Yong Zhou
Xiaoyang Li
Yuanming Shi
Kaibin Huang
Khaled B. Letaief
216
109
0
15 Aug 2024
Can Large Language Models Understand Symbolic Graphics Programs?
International Conference on Learning Representations (ICLR), 2024
Zeju Qiu
Weiyang Liu
Haiwen Feng
Zhen Liu
Tim Z. Xiao
Katherine M. Collins
J. Tenenbaum
Adrian Weller
Michael J. Black
Bernhard Schölkopf
563
27
0
15 Aug 2024
How Well Can Vision Language Models See Image Details?
Chenhui Gou
Abdulwahab Felemban
Faizan Farooq Khan
Deyao Zhu
Jianfei Cai
Hamid Rezatofighi
Mohamed Elhoseiny
VLM
MLLM
212
12
0
07 Aug 2024
Fairness and Bias Mitigation in Computer Vision: A Survey
Sepehr Dehdashtian
Ruozhen He
Yi Li
Guha Balakrishnan
Nuno Vasconcelos
Vicente Ordonez
Vishnu Boddeti
337
11
0
05 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
European Conference on Computer Vision (ECCV), 2024
Wei Chen
Mahdieh Hatamian
Yu Wu
202
16
0
02 Aug 2024
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Xi Lin
Akshat Shrivastava
Liang Luo
Srinivasan Iyer
Mike Lewis
Gargi Gosh
Luke Zettlemoyer
Armen Aghajanyan
MoE
231
50
0
31 Jul 2024
Previous
1
2
3
...
5
6
7
...
25
26
27
Next