Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.01913
Cited By
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
2 March 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning"
7 / 7 papers shown
Title
Platonic Grounding for Efficient Multimodal Language Models
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
L. Varshney
44
52
0
27 Apr 2025
TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation
Shintaro Ozaki
Kazuki Hayashi
Yusuke Sakai
Jingun Kwon
Hidetaka Kamigaito
Katsuhiko Hayashi
Manabu Okumura
Taro Watanabe
VLM
47
0
0
25 Apr 2025
Vision-Language Model Selection and Reuse for Downstream Adaptation
Hao-Zhe Tan
Zhi-Hua Zhou
Lan-Zhe Guo
Yu-Feng Li
VLM
47
0
0
30 Jan 2025
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Y. Guo
VGen
53
2
0
06 Jun 2024
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
254
845
0
17 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
268
2,875
0
11 Feb 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
231
815
0
24 Sep 2019
1