26
0

Learning High-dimensional Gaussians from Censored Data

Abstract

We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are missing not at random (MNAR). The missingness model, denoted by S(y)S(y), is the function that maps any point yy in RdR^d to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings:(i) Self-censoring: An observation xx is generated by first sampling the true value yy from a dd-dimensional Gaussian N(μ,Σ)N(\mu*, \Sigma*) with unknown μ\mu* and Σ\Sigma*. For each coordinate ii, there exists a set SiS_i subseteq RdR^d such that xi=yix_i = y_i if and only if yiy_i in SiS_i. Otherwise, xix_i is missing and takes a generic value (e.g., "?"). We design an algorithm that learns N(μ,Σ)N(\mu*, \Sigma*) up to total variation (TV) distance epsilon, using poly(d,1/ϵ)poly(d, 1/\epsilon) samples, assuming only that each pair of coordinates is observed with sufficiently high probability.(ii) Linear thresholding: An observation xx is generated by first sampling yy from a dd-dimensional Gaussian N(μ,Σ)N(\mu*, \Sigma) with unknown μ\mu* and known Σ\Sigma, and then applying the missingness model SS where S(y)=iin[d]:viTy<=biS(y) = {i in [d] : v_i^T y <= b_i} for some v1,...,vdv_1, ..., v_d in RdR^d and b1,...,bdb_1, ..., b_d in RR. We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability.

View on arXiv
@article{bhattacharyya2025_2504.19446,
  title={ Learning High-dimensional Gaussians from Censored Data },
  author={ Arnab Bhattacharyya and Constantinos Daskalakis and Themis Gouleakis and Yuhao Wang },
  journal={arXiv preprint arXiv:2504.19446},
  year={ 2025 }
}
Comments on this paper