v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

41 / 1,041 papers shown

Title
Patching open-vocabulary models by interpolating weightsNeural Information Processing Systems (NeurIPS), 2022 Gabriel Ilharco Mitchell Wortsman S. Gadre Shuran Song Hannaneh Hajishirzi Simon Kornblith Ali Farhadi Ludwig Schmidt VLM KELM 323 201 0 10 Aug 2022
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology Sangjoon Park Eunha Lee Kyung Sook Shin Jeonghyeon Lee Jong Chul Ye 141 2 0 10 Aug 2022
Advancing Plain Vision Transformer Towards Remote Sensing Foundation ModelIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022 Di Wang Qiming Zhang Yufei Xu Jing Zhang Bo Du Dacheng Tao Guang Dai 261 316 0 08 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models Han Yang Junyang Lin An Yang Peng Wang Chang Zhou Hongxia Yang VLM LRM VPVLM 176 37 0 04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation LearningInternational Conference on Learning Representations (ICLR), 2022 Gukyeong Kwon Zhaowei Cai Avinash Ravichandran Erhan Bas Rahul Bhotika Stefano Soatto 199 84 0 03 Aug 2022
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models Rui Qian Yeqing Li Zheng Xu Ming-Hsuan Yang Serge Belongie Huayu Chen VLM 160 25 0 15 Jul 2022
Convolutional Bypasses Are Better Vision Transformer AdaptersEuropean Conference on Artificial Intelligence (ECAI), 2022 Shibo Jie Zhi-Hong Deng VPVLM 229 156 0 14 Jul 2022
Distance Learner: Incorporating Manifold Prior to Model Training Aditya Chetan Nipun Kwatra 65 1 0 14 Jul 2022
Revisiting Classifier: Transferring Vision-Language Models for Video RecognitionAAAI Conference on Artificial Intelligence (AAAI), 2022 Wenhao Wu Zhun Sun Wanli Ouyang VLM 346 124 0 04 Jul 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningNeural Information Processing Systems (NeurIPS), 2022 Junting Pan Ziyi Lin Xiatian Zhu Jing Shao Jiaming Song 320 259 0 27 Jun 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation Jiahui Yu Yuanzhong Xu Jing Yu Koh Thang Luong Gunjan Baid ... Zarana Parekh Xin Li Han Zhang Jason Baldridge Yonghui Wu EGVM 561 1,349 0 22 Jun 2022
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner Jaehyuk Heo YongGi Jeong Sunwoo Kim Jaehee Kim Pilsung Kang 98 0 0 18 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksInternational Conference on Learning Representations (ICLR), 2022 Jiasen Lu Christopher Clark Rowan Zellers Roozbeh Mottaghi Aniruddha Kembhavi ObjD VLM MLLM 385 472 0 17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation LearningAAAI Conference on Artificial Intelligence (AAAI), 2022 Xiao Xu Chenfei Wu Shachar Rosenman Vasudev Lal Wanxiang Che Nan Duan 244 90 0 17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation Xiaoshuai Hao Yi Zhu Srikar Appalaraju Aston Zhang Wanqian Zhang Boyang Li Mu Li VLM 361 121 0 16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022 Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang ... Ce Liu Yann LeCun Nanyun Peng Jianfeng Gao Lijuan Wang VLM ObjD 226 150 0 15 Jun 2022
Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022 Peng Xu Xiatian Zhu David Clifton ViT 475 819 0 13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsNeural Information Processing Systems (NeurIPS), 2022 Jinguo Zhu Xizhou Zhu Wenhai Wang Xiaohua Wang Jiaming Song Xiaogang Wang Jifeng Dai MoMe MoE 261 84 0 09 Jun 2022
Neural Collapse: A Review on Modelling Principles and Generalization Vignesh Kothapalli 367 103 0 08 Jun 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of ExpertsNeural Information Processing Systems (NeurIPS), 2022 Basil Mustafa C. Riquelme J. Puigcerver Rodolphe Jenatton N. Houlsby VLM MoE 330 270 0 06 Jun 2022
Delving into the Openness of CLIPAnnual Meeting of the Association for Computational Linguistics (ACL), 2022 Shuhuai Ren Lei Li Xuancheng Ren Guangxiang Zhao Xu Sun VLM 220 15 0 04 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph CaptioningNeural Information Processing Systems (NeurIPS), 2022 Yujia Xie Luowei Zhou Xiyang Dai Lu Yuan Nguyen Bach Ce Liu Michael Zeng VLM MLLM 156 30 0 03 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining Hangbo Bao Wenhui Wang Li Dong Furu Wei VLM 158 48 0 02 Jun 2022
Prefix Conditioning Unifies Language and Label SupervisionComputer Vision and Pattern Recognition (CVPR), 2022 Kuniaki Saito Kihyuk Sohn Xinming Zhang Chun-Liang Li Chen-Yu Lee Kate Saenko Tomas Pfister VLM CLIP 159 18 0 02 Jun 2022
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-trainingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022 Yan Zeng Wangchunshu Zhou Ao Luo Ziming Cheng Xinsong Zhang VLM 281 37 0 01 Jun 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining Pengyuan Lyu Chengquan Zhang Shanshan Liu Meina Qiao Yangliu Xu Liang Wu Kun Yao Junyu Han Errui Ding Jingdong Wang 489 46 0 01 Jun 2022
Multimodal Masked Autoencoders Learn Transferable Representations Xinyang Geng Hao Liu Lisa Lee Dale Schuurams Sergey Levine Pieter Abbeel 307 132 0 27 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language Jianfeng Wang Zhengyuan Yang Xiaowei Hu Linjie Li Kevin Qinghong Lin Zhe Gan Zicheng Liu Ce Liu Lijuan Wang VLM 582 698 0 27 May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingNeural Information Processing Systems (NeurIPS), 2022 Chitwan Saharia William Chan Saurabh Saxena Lala Li Jay Whang ... Raphael Gontijo-Lopes Tim Salimans Jonathan Ho David J Fleet Mohammad Norouzi VLM 1.1K 7,395 0 23 May 2022
Deep transfer learning for image classification: a survey J. Plested Musa Phiri Tom Gedeon OOD 190 46 0 20 May 2022
Training Vision-Language Transformers from Captions Liangke Gui Yingshan Chang Qiuyuan Huang Subhojit Som Alexander G. Hauptmann Jianfeng Gao Yonatan Bisk VLM ViT 366 11 0 19 May 2022
When does dough become a bagel? Analyzing the remaining mistakes on ImageNetNeural Information Processing Systems (NeurIPS), 2022 Vijay Vasudevan Benjamin Caine Raphael Gontijo-Lopes Sara Fridovich-Keil Rebecca Roelofs VLM UQCV 181 69 0 09 May 2022
Unlocking High-Accuracy Differentially Private Image Classification through Scale Soham De Leonard Berrada Jamie Hayes Samuel L. Smith Borja Balle 335 261 0 28 Apr 2022
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision NetworksInternational Conference on Learning Representations (ICLR), 2022 Tuomas P. Oikarinen Tsui-Wei Weng VLM 354 122 1 23 Apr 2022
Single-Stream Multi-Level Alignment for Vision-Language PretrainingEuropean Conference on Computer Vision (ECCV), 2022 Zaid Khan B. Vijaykumar Xiang Yu S. Schulter Manmohan Chandraker Y. Fu CLIP VLM 272 21 0 27 Mar 2022
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference timeInternational Conference on Machine Learning (ICML), 2022 Mitchell Wortsman Gabriel Ilharco S. Gadre Rebecca Roelofs Raphael Gontijo-Lopes ... Hongseok Namkoong Ali Farhadi Y. Carmon Simon Kornblith Ludwig Schmidt MoMe 707 1,267 1 10 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-TuningNeural Information Processing Systems (NeurIPS), 2022 Changdae Oh Junhyuk So Hoyoon Byun Yongtaek Lim Minchul Shin Jong-June Jeon Kyungwoo Song 424 38 0 08 Mar 2022
Problem-dependent attention and effort in neural networks with applications to image resolution and model selectionImage and Vision Computing (IVC), 2022 Chris Rohlfs 495 5 0 05 Jan 2022
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets Marcella Cornia Lorenzo Baraldi G. Fiameni Rita Cucchiara 281 14 0 24 Nov 2021
XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers For Convolutional Neural NetworksJournal of Intelligent and Robotic Systems (JIRS), 2021 Jian Sun A. P. Fard Mohammad H. Mahoor 3DPC 229 8 0 21 Nov 2021
The Computational Limits of Deep Learning Neil C. Thompson Kristjan Greenewald Keeheon Lee Gabriel F. Manso VLM 281 626 0 10 Jul 2020