105

A Distribution Testing Approach to Clustering Distributions

Gunjan Kumar
Yash Pote
Jonathan Scarlett
Main:9 Pages
Bibliography:3 Pages
2 Tables
Appendix:14 Pages
Abstract

We study the following distribution clustering problem: Given a hidden partition of kk distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are ε\varepsilon-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size nn, number of distributions kk, size rr of one of the clusters, and distance ε\varepsilon. In particular, we achieve tightness with respect to (n,k,r,ε)(n,k,r,\varepsilon) (up to an O(logk)O(\log k) factor) for all regimes.

View on arXiv
Comments on this paper