Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,325 papers shown
GlitchBench: Can large multimodal models detect video game glitches?
Computer Vision and Pattern Recognition (CVPR), 2023
Mohammad Reza Taesiri
Tianjun Feng
Anh Totti Nguyen
Cor-Paul Bezemer
MLLM
VLM
LRM
361
19
0
08 Dec 2023
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation
Bangyan He
Yang Liu
Yaning Tan
Tianrui Lou
Yang Liu
Simeng Qin
AAML
VLM
342
34
0
08 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
232
22
0
08 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Neural Information Processing Systems (NeurIPS), 2023
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
272
13
0
08 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLM
ObjD
LRM
FAtt
280
24
0
07 Dec 2023
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization
Dongchen Han
Yang Liu
Yang Bai
Jindong Gu
Yang Liu
Simeng Qin
VLM
285
32
0
07 Dec 2023
Mitigating Open-Vocabulary Caption Hallucinations
Assaf Ben-Kish
Moran Yanuka
Morris Alper
Raja Giryes
Hadar Averbuch-Elor
MLLM
VLM
399
14
0
06 Dec 2023
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
Zirui Wang
Zhizhou Sha
Zheng Ding
Yilin Wang
Zhuowen Tu
DiffM
284
14
0
06 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
European Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
242
17
0
05 Dec 2023
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang
Hongyang Li
Feng Li
Tianhe Ren
Xueyan Zou
...
Shijia Huang
Jianfeng Gao
Lei Zhang
Chun-yue Li
Jianwei Yang
352
112
0
05 Dec 2023
Aligning and Prompting Everything All at Once for Universal Visual Perception
Computer Vision and Pattern Recognition (CVPR), 2023
Chunjiang Ge
Chaoyou Fu
Peixian Chen
Mengdan Zhang
Ke Li
Xing Sun
Yunsheng Wu
Shaohui Lin
Rongrong Ji
VLM
ObjD
290
66
0
04 Dec 2023
Good Questions Help Zero-Shot Image Reasoning
Kaiwen Yang
Tao Shen
Xinmei Tian
Xiubo Geng
Chongyang Tao
Dacheng Tao
Wanrong Zhu
LRM
274
10
0
04 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Computer Vision and Pattern Recognition (CVPR), 2023
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
332
153
0
01 Dec 2023
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Computer Vision and Pattern Recognition (CVPR), 2023
M. Steyvers
Yuan Yao
Haoye Zhang
Taiwen He
Yifeng Han
...
Xinyue Hu
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
Tat-Seng Chua
MLLM
VLM
447
345
0
01 Dec 2023
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
263
19
0
30 Nov 2023
The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Lorenzo Bianchi
F. Carrara
Nicola Messina
Claudio Gennaro
Fabrizio Falchi
ObjD
368
26
0
29 Nov 2023
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei
Yang Chen
Haonan Chen
Hexiang Hu
Ge Zhang
Jie Fu
Alan Ritter
Lei Ma
269
128
0
28 Nov 2023
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Computer Vision and Pattern Recognition (CVPR), 2023
Zeyu Han
Fangrui Zhu
Qianru Lao
Huaizu Jiang
ObjD
423
20
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Computer Vision and Pattern Recognition (CVPR), 2023
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
673
861
0
28 Nov 2023
Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis
Xiaohui Chen
Yongfei Liu
Yingxiang Yang
Jianbo Yuan
Quanzeng You
Liping Liu
Hongxia Yang
DiffM
206
15
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
European Conference on Computer Vision (ECCV), 2023
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Yaoyao Liu
Jiahui Yu
VLM
163
3
0
27 Nov 2023
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
Computer Vision and Pattern Recognition (CVPR), 2023
Jiaxuan Li
D. Vo
Akihiro Sugimoto
Hideki Nakayama
KELM
VLM
285
45
0
27 Nov 2023
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Sijie Cheng
Zhicheng Guo
Jingwen Wu
Kechen Fang
Peng Li
Huaping Liu
Yang Liu
EgoV
LRM
265
48
0
27 Nov 2023
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Zhenyu Liu
Wei Wang
Xiaochun Cao
Yuxin Ding
Xiaochun Cao
Min Zhang
185
6
0
27 Nov 2023
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
European Conference on Computer Vision (ECCV), 2023
Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
E. Goles
Jinqiao Wang
ObjD
242
30
0
24 Nov 2023
Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Shicheng Xu
Danyang Hou
Liang Pang
Jingcheng Deng
Jun Xu
Huawei Shen
Xueqi Cheng
261
20
0
23 Nov 2023
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jiaxin Ge
Sanjay Subramanian
Trevor Darrell
Boyi Li
LRM
252
4
0
21 Nov 2023
What's left can't be right -- The remaining positional incompetence of contrastive vision-language models
Nils Hoehing
Ellen Rushe
Anthony Ventresque
VLM
205
4
0
20 Nov 2023
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
302
29
0
18 Nov 2023
The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese
Yunke He
Xixian Liao
Jialing Liang
Gemma Boleda
170
0
0
16 Nov 2023
Trustworthy Large Models in Vision: A Survey
Ziyan Guo
Kepeng Xu
Jun Liu
MU
657
0
0
16 Nov 2023
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder
Abdelrahman Mohamed
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Alcides Alcoba Inciarte
Muhammad Abdul-Mageed
VLM
MLLM
169
13
0
15 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
258
8
0
14 Nov 2023
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Ziyi Lin
Chris Liu
Renrui Zhang
Shiyang Feng
Longtian Qiu
...
Siyuan Huang
Yichi Zhang
Xuming He
Jiaming Song
Yu Qiao
MLLM
VLM
378
275
0
13 Nov 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
Kevin Qinghong Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAG
LM&Ro
396
145
0
13 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
International Conference on Learning Representations (ICLR), 2023
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
276
21
0
13 Nov 2023
Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Chancharik Mitra
Abrar Anwar
Rodolfo Corona
Dan Klein
Trevor Darrell
Jesse Thomason
212
2
0
12 Nov 2023
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Computer Vision and Pattern Recognition (CVPR), 2023
Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
MLLM
200
59
0
11 Nov 2023
GOAT: GO to Any Thing
Matthew Chang
Théophile Gervet
Mukul Khanna
Sriram Yenamandra
Dhruv Shah
...
Saurabh Gupta
Dhruv Batra
Roozbeh Mottaghi
Jitendra Malik
Devendra Singh Chaplot
366
114
0
10 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
404
393
0
10 Nov 2023
Watermarking Vision-Language Pre-trained Models for Multi-modal Embedding as a Service
Yuanmin Tang
Jing Yu
Keke Gai
Xiangyang Qu
Yue Hu
Gang Xiong
Qi Wu
AAML
WaLM
VLM
214
11
0
10 Nov 2023
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Conference on Robot Learning (CoRL), 2023
Georgios Tziafas
Yucheng Xu
Arushi Goel
Mohammadreza Kasaei
Zhibin Li
Hamidreza Kasaei
243
41
0
09 Nov 2023
Active Mining Sample Pair Semantics for Image-text Matching
Yongfeng Chen
Jin Liu
Zhijing Yang
Ruihan Chen
Junpeng Tan
VLM
210
0
0
09 Nov 2023
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Jinjin Xu
Liwu Xu
Yuzhe Yang
Xiang Li
Fanyi Wang
Yanchun Xie
Yi-Jie Huang
Yaqian Li
MoE
MLLM
VLM
444
24
0
09 Nov 2023
NExT-Chat: An LMM for Chat, Detection and Segmentation
Ao Zhang
Yuan Yao
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
MLLM
VLM
374
77
0
08 Nov 2023
Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2023
Yaoxian Song
Yixiang Chen
Haoyu Liu
Li Zhixu
Wei Song
Yanghua Xiao
Xiaofang Zhou
LM&Ro
227
34
0
07 Nov 2023
GLaMM: Pixel Grounding Large Multimodal Model
Computer Vision and Pattern Recognition (CVPR), 2023
H. Rasheed
Muhammad Maaz
Sahal Shaji Mullappilly
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakkal
Rao M. Anwer
Erix Xing
Ming-Hsuan Yang
Fahad S. Khan
MLLM
VLM
434
405
0
06 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
Neural Information Processing Systems (NeurIPS), 2023
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
720
720
0
06 Nov 2023
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
VLM
283
2
0
05 Nov 2023
A New Fine-grained Alignment Method for Image-text Matching
Yang Zhang
167
1
0
03 Nov 2023
Previous
1
2
3
...
10
11
12
...
25
26
27
Next
Page 11 of 27
Page
of 27
Go