ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1702.05574
54
18
v1v2v3 (latest)

Sample complexity of population recovery

18 February 2017
Yury Polyanskiy
A. Suresh
Yihong Wu
ArXiv (abs)PDFHTML
Abstract

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size nnn conducted on a population of individuals, where each pollee is asked to answer ddd binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability ϵ\epsilonϵ, (b) in noisy population recovery, a pollee may lie on each question with probability ϵ\epsilonϵ. Given nnn lossy or noisy samples, the goal is to estimate the probabilities of all 2d2^d2d binary vectors simultaneously within accuracy δ\deltaδ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is Θ~(δ−2max⁡{ϵ1−ϵ,1})\tilde\Theta(\delta^{-2\max\{\frac{\epsilon}{1-\epsilon},1\}})Θ~(δ−2max{1−ϵϵ​,1}), improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when ϵ\epsilonϵ exceeds 1/21/21/2. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as exp⁡(Θ(d1/3log⁡2/3(1/δ)))\exp(\Theta(d^{1/3} \log^{2/3}(1/\delta)))exp(Θ(d1/3log2/3(1/δ))) except for the trivial cases of ϵ=0,1/2\epsilon=0,1/2ϵ=0,1/2 or 111. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

View on arXiv
Comments on this paper