How to Estimate Change from Samples

Knowledge Discovery and Data Mining (KDD), 2012

22 March 2012

Haim Kaplan

Abstract

Measurements, snapshots of a system, traffic matrices, and activity logs are typically collected repeatedly. {\em Difference queries} are then used to detect and localize changes for anomaly detection, monitoring, and planning. When the data is sampled, as is often done to meet resource constraints, queries are processed over the sampled data. We are not aware, however, of previously known estimators for $L_p$ ( $p$ -norm) distances which are accurate when only a small fraction of the data is sampled. We derive estimators for $L_p$ distances that are nonnegative and variance optimal in a Pareto sense, building on our recent work on estimating general functions. Our estimators are applicable both when samples are independent or coordinated. For coordinated samples we present two estimators that tradeoff variance according to similarity of the data. Moreover, one of the estimators has the property that for all data, has variance is close to the minimum possible for that data. We study performance of our Manhattan and Euclidean distance ( $p=1,2$ ) estimators on diverse datasets, demonstrating scalability and accuracy -- we obtain accurate estimates even when a small fraction of the data is sampled. We also demonstrate the benefit of tailoring the estimator to characteristics of the dataset.

View on arXiv

Comments on this paper