33
32

Data fission: splitting a single data point

Abstract

Suppose we observe a random vector XX from some distribution PP in a known family with unknown parameters. We ask the following question: when is it possible to split XX into two parts f(X)f(X) and g(X)g(X) such that neither part is sufficient to reconstruct XX by itself, but both together can recover XX fully, and the joint distribution of (f(X),g(X))(f(X),g(X)) is tractable? As one example, if X=(X1,,Xn)X=(X_1,\dots,X_n) and PP is a product distribution, then for any m<nm<n, we can split the sample to define f(X)=(X1,,Xm)f(X)=(X_1,\dots,X_m) and g(X)=(Xm+1,,Xn)g(X)=(X_{m+1},\dots,X_n). Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

View on arXiv
Comments on this paper