Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1504.00325
Cited By
v1
v2 (latest)
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,519 papers shown
MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging
Noel C. F. Codella
Ying Jin
Shrey Jain
Yu Gu
Ho Hin Lee
...
Lei Li
Thomas Lin
Ivan Tarapov
M. Lungren
Mu-Hsin Wei
LM&MA
VLM
MedIm
315
32
0
09 Oct 2024
M
3
E
L
M^3EL
M
3
E
L
: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking
Fang Wang
Shenglin Yin
Xiaoying Bai
Minghao Hu
Tianwei Yan
Yi Liang
VLM
247
1
0
08 Oct 2024
SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection
ACM Multimedia (MM), 2024
Zishuo Wang
Wenhao Zhou
Jinglin Xu
Yuxin Peng
ObjD
VLM
208
7
0
08 Oct 2024
Precise Model Benchmarking with Only a Few Observations
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Riccardo Fogliato
Pratik Patil
Nil-Jana Akpinar
Mathew Monfort
209
1
0
07 Oct 2024
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youngtaek Oh
Jae-Won Cho
Dong-Jin Kim
In So Kweon
Junmo Kim
VLM
CoGe
CLIP
343
11
0
07 Oct 2024
MM-R
3
^3
3
: On (In-)Consistency of Vision-Language Models (VLMs)
Shih-Han Chou
Shivam Chandhok
James J. Little
Leonid Sigal
289
0
0
07 Oct 2024
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
International Conference on Learning Representations (ICLR), 2024
Han Lin
Tushar Nagarajan
Nicolas Ballas
Mido Assran
Mojtaba Komeili
Joey Tianyi Zhou
Koustuv Sinha
AI4TS
300
7
0
04 Oct 2024
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation
Sen Fang
Sizhou Chen
Yalin Feng
Xiaofeng Zhang
T. Teoh
171
0
0
04 Oct 2024
Toward a Holistic Evaluation of Robustness in CLIP Models
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Weijie Tu
Weijian Deng
Tom Gedeon
VLM
349
7
0
02 Oct 2024
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
Qi Jia
Xiang Yue
Shanshan Huang
Ziheng Qin
Yizhu Liu
Bill Yuchen Lin
Yang You
Guangtao Zhai
VLM
247
2
0
02 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
303
66
1
30 Sep 2024
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
ACM Multimedia (MM), 2024
Yabing Wang
Le Wang
Qiang-feng Zhou
Zhibin Wang
Hao Li
Gang Hua
Wei Tang
222
21
0
30 Sep 2024
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats
Kuanrong Liu
Yaning Tan
Jiawei Liang
Pengwen Dai
Xiaochun Cao
MU
AAML
273
3
0
29 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
299
19
0
27 Sep 2024
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
290
483
0
27 Sep 2024
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Soeun Lee
Si-Woo Kim
Taewhan Kim
Dong-Jin Kim
CLIP
VLM
217
6
0
26 Sep 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Computer Vision and Pattern Recognition (CVPR), 2024
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLM
VLM
457
58
0
25 Sep 2024
Understanding the Cognitive Complexity in Language Elicited by Product Images
Yan-Ying Chen
Shabnam Hakimi
Monica P Van
Francine Chen
Matthew K. Hong
M. Klenk
Charlene C. Wu
255
1
0
25 Sep 2024
Enhancing Advanced Visual Reasoning Ability of Large Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhiyuan Li
Dongnan Liu
Chaoyi Zhang
Heng Wang
Tengfei Xue
Weidong Cai
VLM
LRM
259
17
0
21 Sep 2024
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Li Zhou
Xu Yuan
Zenghui Sun
Zikun Zhou
Jingsong Lan
VLM
MLLM
861
7
0
20 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Neural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
505
5
0
19 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Hanane Azzag
Hanane Azzag
M. Lebbah
ObjD
349
2
0
17 Sep 2024
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Sina Malakouti
Aysan Aghazadeh
Ashmit Khandelwal
Adriana Kovashka
VLM
378
4
0
16 Sep 2024
Evaluating authenticity and quality of image captions via sentiment and semantic analyses
Aleksei Krotov
Alison Tebo
Dylan K. Picart
Aaron Dean Algave
128
1
0
14 Sep 2024
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha
Vinija Jain
Vasu Sharma
187
13
0
14 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu
Shitong Shao
Bao Li
Lichen Bai
Zhiqiang Xu
Haoyi Xiong
James Kwok
Sumi Helal
Bo Han
463
22
0
11 Sep 2024
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Xi Chen
Haosen Yang
Sheng Jin
Xiatian Zhu
Huanjin Yao
VLM
244
6
0
05 Sep 2024
A New People-Object Interaction Dataset and NVS Benchmarks
International Conference on Information Photonics (ICIP), 2024
Shuai Guo
Houqiang Zhong
Qi Wang
Ziyu Chen
Yijie Gao
Jiajing Yuan
Chenyu Zhang
Rong Xie
Li Song
268
1
0
03 Sep 2024
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models
British Machine Vision Conference (BMVC), 2024
Bin Fu
Qiyang Wan
Jialin Li
Ruiping Wang
Xilin Chen
150
1
0
03 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
172
1
0
02 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead
Jacob Phillips
Sean Hendryx
183
0
0
30 Aug 2024
Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
Conference on Computer and Communications Security (CCS), 2024
Yixin Wu
Yun Shen
Michael Backes
Yang Zhang
263
7
0
30 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
305
42
0
28 Aug 2024
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach
Jiwei Guan
Tianyu Ding
Longbing Cao
Lei Pan
Chen Wang
Xi Zheng
AAML
287
3
0
24 Aug 2024
ParGo: Bridging Vision-Language with Partial and Global Views
AAAI Conference on Artificial Intelligence (AAAI), 2024
An-Lan Wang
Bin Shan
Wei Shi
Kun-Yu Lin
Xiang Fei
Guozhi Tang
Lei Liao
Jingqun Tang
Can Huang
Wei-Shi Zheng
MLLM
VLM
519
23
0
23 Aug 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
Yaqi Zhao
Yajie Zhang
Yuanxing Zhang
Ke Lin
Jiahao Wang
Pengfei Wan
Di Zhang
Baoqun Yin
Wentao Zhang
LRM
332
11
0
21 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
AAAI Conference on Artificial Intelligence (AAAI), 2024
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
695
5
0
19 Aug 2024
Quality Assessment in the Era of Large Models: A Survey
Zicheng Zhang
Yingjie Zhou
Chunyi Li
Baixuan Zhao
Xiaohong Liu
Guangtao Zhai
344
33
0
17 Aug 2024
Can Large Language Models Understand Symbolic Graphics Programs?
International Conference on Learning Representations (ICLR), 2024
Zeju Qiu
Weiyang Liu
Haiwen Feng
Zhen Liu
Tim Z. Xiao
Katherine M. Collins
J. Tenenbaum
Adrian Weller
Michael J. Black
Bernhard Schölkopf
602
28
0
15 Aug 2024
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
European Conference on Computer Vision (ECCV), 2024
Sungyeon Kim
Boseung Jeong
Donghyun Kim
Suha Kwak
VLM
232
9
0
11 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
European Conference on Computer Vision (ECCV), 2024
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas Guibas
P. Milanfar
Feng Yang
341
2
0
07 Aug 2024
Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey
ACM Computing Surveys (ACM CSUR), 2024
V. T. Truong
Luan Ba Dang
Long Bao Le
DiffM
MedIm
341
45
0
06 Aug 2024
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
European Conference on Computer Vision (ECCV), 2024
Xianyu Chen
Ming Jiang
Qi Zhao
213
8
0
05 Aug 2024
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Zihan Li
Diping Song
Zefeng Yang
Deming Wang
Fei Li
Xiulan Zhang
P. E. Kinahan
Yu Qiao
VLM
LM&MA
329
20
0
05 Aug 2024
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
Juhwan Choi
Junehyoung Kwon
Jungmin Yun
Seunguk Yu
Youngbin Kim
309
3
0
29 Jul 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
225
1
0
28 Jul 2024
LLAVADI: What Matters For Multimodal Large Language Models Distillation
Shilin Xu
Xiangtai Li
Haobo Yuan
Lu Qi
Yunhai Tong
Ming-Hsuan Yang
216
15
0
28 Jul 2024
SWIFT: Semantic Watermarking for Image Forgery Thwarting
Gautier Evennou
Vivien Chappelier
Ewa Kijak
Teddy Furon
245
6
0
26 Jul 2024
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
Jihyung Kil
Zheda Mai
Justin Lee
Zihe Wang
Kerrie Cheng
Jingyan Bai
Ye Liu
A. Chowdhury
Wei-Lun Chao
CoGe
VLM
345
1
0
23 Jul 2024
Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning
Xinwei Liu
Yang Liu
Yuan Xun
Yaning Tan
Simeng Qin
283
13
0
23 Jul 2024
Previous
1
2
3
...
5
6
7
...
29
30
31
Next