199

Stream Sampling for Frequency Cap Statistics

Abstract

Unaggregated data streams are prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. The elements of the stream have {\em keys} (cookies, users, queries) and elements with different keys interleave in the stream. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct keys}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. These are two special cases of {\em frequency cap} statistics, defined as the sum of frequencies {\em capped} by a parameter TT, which are popular in online advertising platforms. We propose a novel general framework for sampling unaggregated streams which provides the first effective stream sampling solution for general frequency cap statistics. Our \ell-capped samples provide estimates with tight statistical guarantees for cap statistics with T=Θ()T=\Theta(\ell) and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. Our algorithms and estimators are simple and practical and we demonstrate their effectiveness using extensive simulations. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.

View on arXiv
Comments on this paper