18
5

Optimal bounds for p\ell_p sensitivity sampling via 2\ell_2 augmentation

Abstract

Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension dd times the total sensitivity S\mathfrak S while providing strong (1±ε)(1\pm\varepsilon) guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general O~(ε2Sd)\tilde O(\varepsilon^{-2}\mathfrak Sd) bound for the important problem of p\ell_p subspace embeddings to O~(ε2S2/p)\tilde O(\varepsilon^{-2}\mathfrak S^{2/p}) for p[1,2]p\in[1,2]. Their result was subsumed by an earlier O~(ε2Sd1p/2)\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2}) bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain p\ell_p sensitivities. We observe that by augmenting the p\ell_p sensitivities by 2\ell_2 sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear O~(ε2(S+d))=O~(ε2d)\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d) sampling complexity for all p[1,2]p \in [1,2]. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for p[1,2]p \in [1,2] and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an O~(ε2μd)\tilde O(\varepsilon^{-2}\mu d) sensitivity sampling bound for logistic regression, where μ\mu is a natural complexity measure for this problem. This improves over the previous O~(ε2μ2d)\tilde O(\varepsilon^{-2}\mu^2 d) bound of Mai et al. (2021) which was based on Lewis weights subsampling.

View on arXiv
Comments on this paper