Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim
Seunghyun Yoon
Trung Bui
Handong Zhao
Quan Tran
Franck Dernoncourt
Jaewoo Kang
CLIP
238
4
0
23 Feb 2024
Uncertainty-Aware Evaluation for Vision-Language Models
Vasily Kostumov
Bulat Nutfullin
Oleg Pilipenko
Eugene Ilyushin
ELM
440
16
0
22 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
390
33
0
20 Feb 2024
A Survey on Knowledge Distillation of Large Language Models
Xiaohan Xu
Ming Li
Chongyang Tao
Tao Shen
Reynold Cheng
Jinyang Li
Can Xu
Dacheng Tao
Wanrong Zhu
KELM
VLM
464
235
0
20 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Vasu Sharma
VLM
542
68
0
20 Feb 2024
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann
Naman D. Singh
Francesco Croce
Matthias Hein
VLM
AAML
389
86
0
19 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
416
10
0
08 Feb 2024
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Qirui Jiao
Daoyuan Chen
Yilun Huang
Yaliang Li
Ying Shen
189
12
0
31 Jan 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
148
5
0
31 Jan 2024
YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng
Lin Song
Yixiao Ge
Wenyu Liu
Xinggang Wang
Ying Shan
VLM
ObjD
404
650
0
30 Jan 2024
PACE: A Pragmatic Agent for Enhancing Communication Efficiency Using Large Language Models
Jiaxuan Li
Minxi Yang
Dahua Gao
Wenlong Xu
Guangming Shi
188
0
0
30 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLM
MLLM
220
6
0
29 Jan 2024
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
International Joint Conference on Artificial Intelligence (IJCAI), 2024
Xue Sun
Xinya Wu
Pengfei Zhou
Richeng Xuan
Guang Liu
Xi Yang
Qiannan Zhu
Hua Huang
ELM
LRM
297
28
0
25 Jan 2024
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Siwei Wu
Yi Zhou
Kang Zhu
Ge Zhang
Yiming Liang
...
Lei Ma
Wenhao Huang
Noura Al Moubayed
Jie Fu
Chenghua Lin
244
23
0
24 Jan 2024
ChatterBox: Multi-round Multimodal Referring and Grounding
Yunjie Tian
Tianren Ma
Lingxi Xie
Jihao Qiu
Xi Tang
Yuan Zhang
Jianbin Jiao
Qi Tian
Qixiang Ye
213
23
0
24 Jan 2024
Prompting Large Vision-Language Models for Compositional Reasoning
Timothy Ossowski
Ming Jiang
Junjie Hu
CoGe
VLM
LRM
235
7
0
20 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLM
CLIP
250
14
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Jiaming Song
Yu Qiao
Jifeng Dai
AuLLM
240
70
0
18 Jan 2024
Improving fine-grained understanding in image-text pre-training
Ioana Bica
Anastasija Ilić
Matthias Bauer
Goker Erdogan
Matko Bovsnjak
...
A. Gritsenko
Matthias Minderer
Charles Blundell
Razvan Pascanu
Jovana Mitrović
VLM
220
45
0
18 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
IEEE International Conference on Multimedia and Expo (ICME), 2024
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLM
MLLM
210
3
0
17 Jan 2024
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
Anh-Cuong Pham
Van-Quang Nguyen
Thi-Hong Vuong
Quang-Thuy Ha
265
2
0
16 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
621
95
0
11 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
191
1
0
11 Jan 2024
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Yiqi Wang
Wentao Chen
Xiaotian Han
Xudong Lin
Haiteng Zhao
Yongfei Liu
Bohan Zhai
Jianbo Yuan
Quanzeng You
Hongxia Yang
LRM
315
146
0
10 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
277
6
0
06 Jan 2024
Object-oriented backdoor attack against image captioning
Meiling Li
Nan Zhong
Xinpeng Zhang
Zhenxing Qian
Sheng Li
173
10
0
05 Jan 2024
Towards Weakly Supervised Text-to-Audio Grounding
Xuenan Xu
Ziyang Ma
Mengyue Wu
Kai Yu
AI4TS
356
17
0
05 Jan 2024
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Xiangyu Zhao
Yicheng Chen
Shilin Xu
Xiangtai Li
Xinjiang Wang
Yining Li
Haian Huang
ObjD
AI4CE
309
55
0
04 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
212
12
0
04 Jan 2024
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
International Conference on Machine Learning (ICML), 2024
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
233
7
0
04 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
International Conference on Machine Learning (ICML), 2024
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLM
VLM
LLMAG
385
407
0
03 Jan 2024
BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving
AAAI Conference on Artificial Intelligence (AAAI), 2024
Tao Tang
Dafeng Wei
Zhengyu Jia
Tian Gao
Changwei Cai
...
Yixing Zhao
Han Li
Xiaodan Liang
Xianpeng Lang
Yang Wang
217
16
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
Kevin Qinghong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
208
15
0
01 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object Detectors
Computer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
453
13
0
29 Dec 2023
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation
Chinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2023
Jiaxi Wang
Wenhui Hu
Xueyang Liu
Beihu Wu
Yuting Qiu
Yingying Cai
281
1
0
29 Dec 2023
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
717
170
0
29 Dec 2023
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu
Limeng Qiao
Xinyang Lin
Shuang Xu
Yang Yang
...
Fei Wei
Xinyu Zhang
Bo Zhang
Xiaolin Wei
Chunhua Shen
MLLM
312
70
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
198
30
0
27 Dec 2023
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang
Jiajun Deng
Mingbo Jia
ObjD
235
13
0
23 Dec 2023
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Haozhan Shen
Tiancheng Zhao
Mingwei Zhu
Yuxiang Cai
VLM
ObjD
420
26
0
22 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
641
2,182
0
21 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
255
10
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
386
50
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
271
41
0
19 Dec 2023
Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
P. Nguyen
T.D. Ngo
E. Kalogerakis
Chuang Gan
Anh Tran
Cuong Pham
Khoi Duc Minh Nguyen
ISeg
474
101
0
17 Dec 2023
Pixel Aligned Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
291
17
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2023
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
326
6
0
14 Dec 2023
Exploration of visual prompt in Grounded pre-trained open-set detection
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Qibo Chen
Weizhong Jin
Shuchang Li
Mengdi Liu
Li Yu
Jian Jiang
Xiaozheng Wang
VLM
121
1
0
14 Dec 2023
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLM
SyDa
460
12
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
414
13
0
11 Dec 2023
Previous
1
2
3
...
9
10
11
...
25
26
27
Next
Page 10 of 27
Page
of 27
Go