51

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek
Sewoong Oh
Simon S. Du
Main:10 Pages
9 Figures
Bibliography:3 Pages
1 Tables
Appendix:27 Pages
Abstract

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting η(0,1]\eta\in(0,1] as the fraction of data with correctly matched modalities among nn paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: (i)(i) the error without filtering is upper and lower bounded by 1ηn\frac{1}{\eta \sqrt{n}}, and (ii)(ii) the error with teacher-based filtering is upper bounded by 1ηn\frac{1}{\sqrt{\eta n}} in the large η\eta regime, and by 1n\frac{1}{\sqrt{n}} in the small η\eta regime.

View on arXiv
Comments on this paper