On a Near-Optimal \& Efficient Algorithm for the Sparse Pooled Data Problem

22 December 2023

Abstract

The pooled data problem asks to identify the unknown labels of a set of items from condensed measurements. More precisely, given $n$ items, assume that each item has a label in $\cbc{0,1,\ldots, d}$ , encoded via the ground-truth $\SIGMA$ . We call the pooled data problem sparse if the number of non-zero entries of $\SIGMA$ scales as $k \sim n^{\theta}$ for $\theta \in (0,1)$ . The information that is revealed about $\SIGMA$ comes from pooled measurements, each indicating how many items of each label are contained in the pool. The most basic question is to design a pooling scheme that uses as few pools as possible, while reconstructing $\SIGMA$ with high probability. Variants of the problem and its combinatorial ramifications have been studied for at least 35 years. However, the study of the modern question of \emph{efficient} inference of the labels has suggested a statistical-to-computational gap of order $\log n$ in the minimum number of pools needed for theoretically possible versus efficient inference. In this article, we resolve the question whether this $\log n$ -gap is artificial or of a fundamental nature by the design of an efficient algorithm, called \algoname, based upon a novel pooling scheme on a number of pools very close to the information-theoretic threshold.

View on arXiv

Comments on this paper