289

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Main:9 Pages
4 Figures
Bibliography:4 Pages
1 Tables
Appendix:8 Pages
Abstract

We propose a novel subset selection task called min-distance diverse data summarization (MDDS\textsf{MDDS}), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint Sk|S| \le k. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the GIST\texttt{GIST} algorithm, which achieves a 23\frac{2}{3}-approximation guarantee for MDDS\textsf{MDDS} by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary (23+ε)(\frac{2}{3}+\varepsilon)-hardness of approximation, for any ε>0\varepsilon > 0. Finally, we provide an empirical study that demonstrates GIST\texttt{GIST} outperforms existing methods for MDDS\textsf{MDDS} on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

View on arXiv
Comments on this paper