v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023

497

325

27 Feb 2023

Cross-modal Contrastive Learning for Multimodal Fake News DetectionACM Multimedia (ACM MM), 2023

263

25 Feb 2023

Language-Driven Representation Learning for Robotics

Dorsa Sadigh

277

190

24 Feb 2023

Side Adapter Network for Open-Vocabulary Semantic SegmentationComputer Vision and Pattern Recognition (CVPR), 2023

311

363

23 Feb 2023

Aligning Text-to-Image Models using Human Feedback

Pieter Abbeel

338

383

23 Feb 2023

Language Model Crossover: Variation through Few-Shot PromptingACM Transactions on Evolutionary Learning and Optimization (TELO), 2023

458

124

23 Feb 2023

Test-Time Distribution Normalization for Contrastively Learned Vision-language ModelsNeural Information Processing Systems (NeurIPS), 2023

Ser-Nam Lim

244

22 Feb 2023

Deep Active Learning in the Presence of Label Noise: A Survey

Moseli Motsóehli

Kyungim Baek

NoLa VLM

278

22 Feb 2023

149

20 Feb 2023

Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023

Yaowei Wang

Yonghong Tian

Wen Gao

AI4CE VLM

464

272

20 Feb 2023

Few-shot Multimodal Multitask Multilingual Learning

Vasu Sharma

Vinija Jain

211

19 Feb 2023

Zero-Shot Anomaly Detection via Batch NormalizationNeural Information Processing Systems (NeurIPS), 2023

470

15 Feb 2023

Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed LearningInternational Symposium on Information Theory (ISIT), 2023

Chanho Park

Namyoon Lee

FedML

157

15 Feb 2023

Symbolic Discovery of Optimization AlgorithmsNeural Information Processing Systems (NeurIPS), 2023

...

769

513

13 Feb 2023

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint DescriptionsFindings (Findings), 2023

229

13 Feb 2023

Less is More: Selective Layer Finetuning with SubTuning

212

13 Feb 2023

Calibrating a Deep Neural Network with Its PredecessorsInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

236

13 Feb 2023

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D DatasetsAAAI Conference on Artificial Intelligence (AAAI), 2023

Gangshan Wu

145

13 Feb 2023

NYCU-TWO at Memotion 3: Good Foundation, Good Teacher, then you have Good Meme Analysis

129

13 Feb 2023

Scaling Vision Transformers to 22 Billion ParametersInternational Conference on Machine Learning (ICML), 2023

...

407

766

10 Feb 2023

Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance

Chaerin Kong

Nojun Kwak

DiffM

172

10 Feb 2023

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

...

194

09 Feb 2023

SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation

199

07 Feb 2023

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image RetrievalComputer Vision and Pattern Recognition (CVPR), 2023

Kuniaki Saito

Kihyuk Sohn

Xiang Zhang

Chun-Liang Li

Chen-Yu Lee

Kate Saenko

Tomas Pfister

308

165

06 Feb 2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative PretrainingInternational Conference on Machine Learning (ICML), 2023

Xiangyu Zhang

396

187

05 Feb 2023

IC3: Image Captioning by Committee ConsensusConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

David M. Chan

Austin Myers

Sudheendra Vijayanarasimhan

David A. Ross

John F. Canny

296

02 Feb 2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoInternational Conference on Machine Learning (ICML), 2023

Jiabo Ye

...

Ji Zhang

Jingren Zhou

254

218

01 Feb 2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersInternational Conference on Machine Learning (ICML), 2023

365

31 Jan 2023

The Power of External Memory in Increasing Predictive Model Capacity

Xin Wang

156

31 Jan 2023

Alternating Updates for Efficient TransformersNeural Information Processing Systems (NeurIPS), 2023

Xin Wang

177

30 Jan 2023

Advancing Radiograph Representation Learning with Masked Record ModelingInternational Conference on Learning Representations (ICLR), 2023

284

30 Jan 2023

Massively Scaling Heteroscedastic ClassifiersInternational Conference on Learning Representations (ICLR), 2023

218

30 Jan 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023

Silvio Savarese

1.3K

6,618

30 Jan 2023

ACL-Fig: A Dataset for Scientific Figure Classification

28 Jan 2023

Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the MeanInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

252

27 Jan 2023

Discovering and Mitigating Visual Biases through Keyword ExplanationComputer Vision and Pattern Recognition (CVPR), 2023

507

26 Jan 2023

Affective Faces for Goal-Driven Dyadic Communication

Scott Geng

Carl Vondrick

127

26 Jan 2023

Masked Autoencoding Does Not Help Natural Language Supervision at ScaleComputer Vision and Pattern Recognition (CVPR), 2023

Floris Weers

Vaishaal Shankar

Angelos Katharopoulos

Yinfei Yang

Tom Gunter

CLIP

347

19 Jan 2023

Towards Models that Can See and ReadIEEE International Conference on Computer Vision (ICCV), 2023

285

18 Jan 2023

Learning Customized Visual Models with Retrieval-Augmented KnowledgeComputer Vision and Pattern Recognition (CVPR), 2023

Jianwei Yang

231

17 Jan 2023

Vision Learners Meet Web Image-Text Pairs

181

17 Jan 2023

RILS: Masked Visual Reconstruction in Language Semantic SpaceComputer Vision and Pattern Recognition (CVPR), 2023

Shusheng Yang

Ying Shan

188

17 Jan 2023

UATVR: Uncertainty-Adaptive Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023

Jingdong Wang

246

16 Jan 2023

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2023

450

152

16 Jan 2023

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility StudyEuropean Conference on Information Retrieval (ECIR), 2023

Mariya Hendriksen

Svitlana Vakulenko

E. Kuiper

Maarten de Rijke

300

12 Jan 2023

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding TasksConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

302

12 Jan 2023

Does progress on ImageNet transfer to real-world datasets?Neural Information Processing Systems (NeurIPS), 2023

193

11 Jan 2023

Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingComputer Vision and Pattern Recognition (CVPR), 2023

Shruthi Bannur

Stephanie L. Hyland

Qianchu Liu

Fernando Pérez-García

Maximilian Ilse

...

Maria T. A. Wetscherek

313

203

11 Jan 2023

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2023

338

101

05 Jan 2023

CiT: Curation in Training for Effective Vision-Language DataIEEE International Conference on Computer Vision (ICCV), 2023

Hu Xu

Saining Xie

Po-Yao (Bernie) Huang

Licheng Yu

Russ Howes

Gargi Ghosh

Luke Zettlemoyer

Christoph Feichtenhofer

VLM DiffM

127

05 Jan 2023