Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Guangyue Xu
Parisa Kordjamshidi
Joyce Chai
162
2
0
02 Nov 2023
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
International Conference on Computational Linguistics (COLING), 2023
Yifan Du
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Jinpeng Wang
Chuyuan Wang
Mingchen Cai
Ruihua Song
Ji-Rong Wen
VLM
MLLM
LRM
524
28
0
02 Nov 2023
CapsFusion: Rethinking Image-Text Data at Scale
Computer Vision and Pattern Recognition (CVPR), 2023
Qiying Yu
Quan-Sen Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Yue Cao
Xinlong Wang
Jingjing Liu
VLM
371
88
0
31 Oct 2023
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
Youbo Lei
Feifei He
Chen Chen
Yingbin Mo
Sijia Li
Defeng Xie
H. Lu
VLM
369
2
0
30 Oct 2023
Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related Gender
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ahmed Sabir
Lluís Padró
353
3
0
29 Oct 2023
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data
Neural Information Processing Systems (NeurIPS), 2023
Taiki Miyanishi
Fumiya Kitamori
Shuhei Kurita
Jungdae Lee
M. Kawanabe
Nakamasa Inoue
AI4TS
3DPC
227
15
0
28 Oct 2023
GROOViST: A Metric for Grounding Objects in Visual Storytelling
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Aditya K Surikuchi
Sandro Pezzelle
Raquel Fernández
152
14
0
26 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
279
8
0
26 Oct 2023
RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments
Neural Information Processing Systems (NeurIPS), 2023
Mengxue Qu
Yu-Huan Wu
Wu Liu
Xiaodan Liang
Jingkuan Song
Yao-Min Zhao
Yunchao Wei
243
19
0
26 Oct 2023
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network
Industrial Conference on Data Mining (IDM), 2023
Yiming Lin
Xiao-Bo Jin
Qiufeng Wang
Kaizhu Huang
159
5
0
25 Oct 2023
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Jiang Ji
Meng Cao
Tengtao Song
Long Chen
Yi Wang
Yuexian Zou
273
6
0
25 Oct 2023
TiC-CLIP: Continual Training of CLIP Models
International Conference on Learning Representations (ICLR), 2023
Saurabh Garg
Mehrdad Farajtabar
Hadi Pouransari
Raviteja Vemulapalli
Sachin Mehta
Oncel Tuzel
Vaishaal Shankar
Fartash Faghri
VLM
CLIP
361
40
0
24 Oct 2023
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Te-Lin Wu
Yu Zhou
Nanyun Peng
194
10
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
249
64
0
23 Oct 2023
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Chunlei Wang
Wenquan Feng
Xiangtai Li
Guangliang Cheng
Shuchang Lyu
Binghao Liu
Lijiang Chen
Qi Zhao
ObjD
VLM
278
14
0
22 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
Baohao Liao
Michael Kozielski
Sanjika Hewavitharana
Jiangbo Yuan
Shahram Khadivi
Tomer Lancewicki
SSL
132
0
0
22 Oct 2023
On the Transferability of Visually Grounded PCFGs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yanpeng Zhao
Ivan Titov
147
1
0
21 Oct 2023
CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
G. O. D. Santos
Diego A. B. Moreira
Alef Iury Ferreira
Jhessica Silva
Luiz Pereira
...
H. Maia
Nádia Da Silva
Esther Colombini
Hélio Pedrini
Sandra Avila
VLM
CLIP
193
7
0
20 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
218
6
0
20 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
299
1
0
20 Oct 2023
InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
Xiangru Jian
Yimu Wang
255
6
0
20 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
156
2
0
20 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-Xiong Wang
431
49
0
19 Oct 2023
Evaluating the Fairness of Discriminative Foundation Models in Computer Vision
AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2023
Junaid Ali
Matthäus Kleindessner
F. Wenzel
Kailash Budhathoki
Volkan Cevher
Chris Russell
VLM
248
15
0
18 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yimu Wang
Xiangru Jian
Bo Xue
204
22
0
17 Oct 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
MLLM
VLM
447
269
0
17 Oct 2023
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Haowei Wang
Jiayi Ji
Tianyu Guo
Yilong Yang
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
353
8
0
17 Oct 2023
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
1.5K
631
0
14 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLM
VLM
412
72
0
13 Oct 2023
Incremental Object Detection with CLIP
Ziyue Huang
Yupeng He
Qingjie Liu
Yunhong Wang
CLL
ObjD
VLM
299
2
0
13 Oct 2023
Ferret: Refer and Ground Anything Anywhere at Any Granularity
International Conference on Learning Representations (ICLR), 2023
Haoxuan You
Haotian Zhang
Zhe Gan
Xianzhi Du
Bowen Zhang
Zirui Wang
Liangliang Cao
Shih-Fu Chang
Yinfei Yang
ObjD
MLLM
VLM
421
455
0
11 Oct 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
European Conference on Computer Vision (ECCV), 2023
Zhengfeng Lai
Haotian Zhang
Bowen Zhang
Wentao Wu
Haoping Bai
...
Zhe Gan
Jiulong Shan
Chen-Nee Chuah
Yinfei Yang
Meng Cao
CLIP
VLM
365
60
0
11 Oct 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
IEEE International Conference on Computer Vision (ICCV), 2023
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
393
23
0
10 Oct 2023
InstructDET: Diversifying Referring Object Detection with Generalized Instructions
International Conference on Learning Representations (ICLR), 2023
Ronghao Dang
Jiangyan Feng
Haodong Zhang
Chongjian Ge
Lin Song
...
Chengju Liu
Qi Chen
Feng Zhu
Rui Zhao
Yibing Song
ObjD
441
16
0
08 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
144
5
0
08 Oct 2023
Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology
International Conference on Human Factors in Computing Systems (CHI), 2023
Brett A. Halperin
S. Lukin
CoGe
214
30
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
International Conference on Learning Representations (ICLR), 2023
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
282
20
0
04 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
International Conference on Learning Representations (ICLR), 2023
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIP
VLM
250
104
0
02 Oct 2023
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Qiyu Wu
Mengjie Zhao
Yutong He
Lang Huang
Junya Ono
Hiromi Wakaki
Yuki Mitsufuji
298
6
0
02 Oct 2023
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
International Conference on Learning Representations (ICLR), 2023
Zixiang Chen
Yihe Deng
Yuanzhi Li
Quanquan Gu
VLM
395
18
0
02 Oct 2023
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Computer Vision and Pattern Recognition (CVPR), 2023
Shiyu Xuan
Qingpei Guo
Ming Yang
Shiliang Zhang
MLLM
ObjD
270
52
0
01 Oct 2023
Black-box Attacks on Image Activity Prediction and its Natural Language Explanations
Alina Elena Baia
Valentina Poggioni
Andrea Cavallaro
AAML
227
1
0
30 Sep 2023
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
European Conference on Computer Vision (ECCV), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
VLM
257
6
0
29 Sep 2023
Retail-786k: a Large-Scale Dataset for Visual Entity Matching
Bianca Lamm
Janis Keuper
VLM
243
4
0
29 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
328
22
0
23 Sep 2023
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
IEEE International Conference on Computer Vision (ICCV), 2023
Kan Wu
Houwen Peng
Zhenghong Zhou
Bin Xiao
Xiyang Dai
...
Xi
Xi Chen
Xinggang Wang
Hongyang Chao
Han Hu
VLM
OODD
257
97
0
21 Sep 2023
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
IEEE International Conference on Computer Vision (ICCV), 2023
Yiming Zhang
ZeMing Gong
Angel X. Chang
397
134
0
11 Sep 2023
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
International Conference on Language Resources and Evaluation (LREC), 2023
Guisheng Liu
Yi Li
Zhengcong Fei
Haiyan Fu
Xiangyang Luo
Yanqing Guo
VLM
DiffM
265
16
0
10 Sep 2023
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
International Conference on Learning Representations (ICLR), 2023
Yang Jin
Kun Xu
Kun Xu
Liwei Chen
Chao Liao
...
Xiaoqiang Lei
Chen Zhang
Wenwu Ou
Kun Gai
Yadong Mu
MLLM
VLM
257
77
0
09 Sep 2023
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
European Conference on Computer Vision (ECCV), 2023
Ozan Unal
Daniel Gehrig
Suman Saha
Luc Van Gool
288
29
0
08 Sep 2023
Previous
1
2
3
...
11
12
13
...
25
26
27
Next
Page 12 of 27
Page
of 27
Go