ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.01968
22
0

Missing ggg-mass: Investigating the Missing Parts of Distributions

5 October 2021
Prafulla Chandra
A. Thangaraj
ArXivPDFHTML
Abstract

Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities Pr(x)\text{Pr}(x)Pr(x) over the missing letters xxx, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function ggg from [0,1][0,1][0,1] to the reals, the missing ggg-mass, defined as the sum of g(Pr(x))g(\text{Pr}(x))g(Pr(x)) over the missing letters xxx, is introduced and studied. The missing ggg-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-α\alphaα missing mass (g(p)=pαg(p)=p^{\alpha}g(p)=pα) and the missing Shannon entropy (g(p)=−plog⁡pg(p)=-p\log pg(p)=−plogp) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-α\alphaα missing mass for integer values of α\alphaα and exact minimax convergence rates are obtained. Concentration is studied for a class of functions ggg and specific results are derived for order-α\alphaα missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration.

View on arXiv
Comments on this paper