33
3
v1v2 (latest)

Daisy Bloom Filters

Abstract

A filter is a widely used data structure for storing an approximation of a given set SS of elements from some universe UU (a countable set).It represents a superset SSS'\supseteq S that is ''close to SS'' in the sense that for x∉Sx\not\in S, the probability that xSx\in S' is bounded by some ε>0\varepsilon > 0. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store SS exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in SS with probability close to 1. Then it would make sense to always include them in SS', saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most ε\varepsilon with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the Daisy Bloom filter\textit{Daisy Bloom filter}, that executes operations faster and uses significantly less space than the standard Bloom filter.

View on arXiv
Comments on this paper