We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility (), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points, subject to a cardinality constraint. We propose the algorithm, which achieves a -approximation guarantee for by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate to within a factor of . Finally, we demonstrate that outperforms existing benchmarks for on a real-world image classification task that studies single-shot subset selection for ImageNet.
View on arXiv@article{fahrbach2025_2405.18754, title={ GIST: Greedy Independent Set Thresholding for Diverse Data Summarization }, author={ Matthew Fahrbach and Srikumar Ramalingam and Morteza Zadimoghaddam and Sara Ahmadian and Gui Citovsky and Giulia DeSalvo }, journal={arXiv preprint arXiv:2405.18754}, year={ 2025 } }