Probably Approximately Correct Labels

12 June 2025

Emmanuel J. Candès

Andrew Ilyas

Tijana Zrnic

ArXiv (abs)PDF HTML

Main:13 Pages

7 Figures

Bibliography:3 Pages

7 Tables

Appendix:2 Pages

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

View on arXiv

@article{candès2025_2506.10908,
  title={ Probably Approximately Correct Labels },
  author={ Emmanuel J. Candès and Andrew Ilyas and Tijana Zrnic },
  journal={arXiv preprint arXiv:2506.10908},
  year={ 2025 }
}

Comments on this paper