Detecting Label Errors using Pre-Trained Language Models

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

25 May 2022

Derek Chong

Jenny Hong

Christopher D. Manning

NoLa

ArXiv (abs)PDF HTML Github

Main:8 Pages

10 Figures

Bibliography:5 Pages

10 Tables

Appendix:5 Pages

Abstract

We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.

View on arXiv

Comments on this paper