Detecting Label Errors using Pre-Trained Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
- NoLa
Main:8 Pages
10 Figures
Bibliography:5 Pages
10 Tables
Appendix:5 Pages
Abstract
We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.
View on arXivComments on this paper
