Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

4 April 2023

Vladislav Lialin

Papers citing "Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data"

9 / 9 papers shown

Title
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Yash Jain David M. Chan Pranav Dheram Aparna Khare Olabanji Shonibare Venkatesh Ravichandran Shalini Ghosh 32 2 0 28 Mar 2024
Distribution Aware Metrics for Conditional Natural Language Generation David M. Chan Yiming Ni David A. Ross Sudheendra Vijayanarasimhan Austin Myers John F. Canny 35 4 0 15 Sep 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong S. Hoi MLLM BDL VLM CLIP 390 4,110 0 28 Jan 2022
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts Soravit Changpinyo P. Sharma Nan Ding Radu Soricut VLM 273 1,077 0 17 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu H. Pham Quoc V. Le Yun-hsuan Sung Zhen Li Tom Duerig VLM CLIP 293 3,683 0 11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius Heng Wang Lorenzo Torresani ViT 278 1,978 0 09 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation Jaemin Cho Jie Lei Hao Tan Mohit Bansal MLLM 249 525 0 04 Feb 2021
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 226 4,424 0 23 Jan 2020
Xception: Deep Learning with Depthwise Separable Convolutions François Chollet MDE BDL PINN 201 14,304 0 07 Oct 2016