6
7

Coresets for Data Discretization and Sine Wave Fitting

Abstract

In the \emph{monitoring} problem, the input is an unbounded stream P=p1,p2P={p_1,p_2\cdots} of integers in [N]:={1,,N}[N]:=\{1,\cdots,N\}, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the nn points received so far in PP by a single frequency sin\sin, e.g. mincCcost(P,c)+λ(c)\min_{c\in C}cost(P,c)+\lambda(c), where cost(P,c)=i=1nsin2(2πNpic)cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic), C[N]C\subseteq [N] is a feasible set of solutions, and λ\lambda is a given regularization function. For any approximation error ε>0\varepsilon>0, we prove that \emph{every} set PP of nn integers has a weighted subset SPS\subseteq P (sometimes called core-set) of cardinality SO(log(N)O(1))|S|\in O(\log(N)^{O(1)}) that approximates cost(P,c)cost(P,c) (for every c[N]c\in [N]) up to a multiplicative factor of 1±ε1\pm\varepsilon. Using known coreset techniques, this implies streaming algorithms using only O((log(N)log(n))O(1))O((\log(N)\log(n))^{O(1)}) memory. Our results hold for a large family of functions. Experimental results and open source code are provided.

View on arXiv
Comments on this paper