CleanPatrick: A Benchmark for Image Data Cleaning

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.
View on arXiv@article{gröger2025_2505.11034, title={ CleanPatrick: A Benchmark for Image Data Cleaning }, author={ Fabian Gröger and Simone Lionetti and Philippe Gottfrois and Alvaro Gonzalez-Jimenez and Ludovic Amruthalingam and Elisabeth Victoria Goessinger and Hanna Lindemann and Marie Bargiela and Marie Hofbauer and Omar Badri and Philipp Tschandl and Arash Koochek and Matthew Groh and Alexander A. Navarini and Marc Pouly }, journal={arXiv preprint arXiv:2505.11034}, year={ 2025 } }