How to Estimate Change from Samples

Knowledge Discovery and Data Mining (KDD), 2012

22 March 2012

Haim Kaplan

Abstract

Measurement data, snapshots of a system, and traffic or activity logs are typically collected repeatedly. {\em Difference queries}, which identify and measure change, are central to anomaly detection, monitoring, and planning. When the data is sampled, as is often necessary to meet resource constraints, queries need to be processed over the sampled data. Surprisingly, however, we are not aware of pre-existing satisfactory estimators even for Euclidean distances. We derive estimators for $L_p$ ( $p$ -norm) distances that are nonnegative and variance optimal in a Pareto sense. Our estimators are suitable for independent or coordinated samples of the data and have provable strong properties. For coordinated sampling we present two estimators that tradeoff variance according to similarity of the data. Moreover, one of the estimators has the property that for all data, has variance is close to the minimum possible for that data. We study performance of our estimators for Manhattan and Euclidean distances ( $p=1,2$ ) on diverse datasets, demonstrating scalability and accuracy.

View on arXiv

Comments on this paper