38
3

Daisy Bloom Filters

Abstract

Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) are Bloom filters that adapt the number of hash functions according to the query element. That is, they use a sequence of hash functions h1,h2,h_1, h_2, \dots and insert xx by setting the bits in kxk_x positions h1(x),h2(x),,hkx(x)h_1(x), h_2(x), \dots, h_{k_x}(x) to 1, where the parameter kxk_x depends on xx. Similarly, a query for xx checks whether the bits at positions h1(x),h2(x),,hkx(x)h_1(x), h_2(x), \dots, h_{k_x}(x) contain a 00 (in which case we know that xx was not inserted), or contains only 11s (in which case xx may have been inserted, but it could also be a false positive). In this paper, we determine a near-optimal choice of the parameters kxk_x in a model where nn elements are inserted independently from a probability distribution P\mathcal{P} and query elements are chosen from a probability distribution Q\mathcal{Q}, under a bound on the false positive probability FF. In contrast, the parameter choice of Bruck et al., as well as follow-up work by Wang et al., does not guarantee a nontrivial bound on the false positive rate. We refer to our parameterization of the weighted Bloom filter as a Daisy Bloom filter\textit{Daisy Bloom filter}. For many distributions P\mathcal{P} and Q\mathcal{Q}, the Daisy Bloom filter space usage is significantly smaller than that of Standard Bloom filters. Our upper bound is complemented with an information-theoretical lower bound, showing that (with mild restrictions on the distributions P\mathcal{P} and Q\mathcal{Q}), the space usage of Daisy Bloom filters is the best possible up to a constant factor. Daisy Bloom filters can be seen as a fine-grained variant of a recent data structure of Vaidya, Knorr, Mitzenmacher and Kraska. Like their work, we are motivated by settings in which we have prior knowledge of the workload of the filter, possibly in the form of advice from a machine learning algorithm.

View on arXiv
Comments on this paper