Geometric Median Matching for Robust k-Subset Selection from Noisy Data

1 April 2025

Anish Acharya

Sujay Sanghavi

Alexandros G. Dimakis

Inderjit S Dhillon

AAML

ArXiv (abs)PDF HTML Github

Main:34 Pages

17 Figures

Bibliography:5 Pages

11 Tables

Appendix:9 Pages

Abstract

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers.

View on arXiv

Comments on this paper