Unsupervised Data Selection for Data-Centric Semi-Supervised Learning

6 October 2021

Abstract

We study unsupervised data selection for semi-supervised learning (SSL), where a large-scale unlabeled dataset is available and a small subset of data is budgeted for label acquisition. Existing SSL methods focus on learning a model that effectively integrates information from given small labeled data and large unlabeled data, whereas we focus on selecting the right data to annotate for SSL without requiring any label or task information. Intuitively, instances to be labeled shall collectively have maximum diversity and coverage for downstream tasks, and individually have maximum information propagation utility for SSL. We formalize these concepts in a three-step data-centric SSL method that improves FixMatch in stability and accuracy by 8% on CIFAR-10 (0.08% labeled) and 14% on ImageNet-1K (0.2% labeled). It is also a universal framework that works with various SSL methods, delivering consistent performance gains. Our work demonstrates that small computation spent on carefully selecting data for annotation brings big annotation efficiency and model performance gain without changing the learning pipeline. Our completely unsupervised data selection can be easily extended to other weakly supervised learning settings.

View on arXiv

Comments on this paper