Near-Optimal Model Discrimination with Non-Disclosure
- FedML

Let be the population risk minimizers associated to some loss and two distributions on . We pose the following question: Given i.i.d. samples from and , what sample sizes are sufficient and necessary to distinguish between the two hypotheses and for given ? Making the first steps towards answering this question in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity, showing it to be up to a constant factor, where is a measure of separation between and , and is the rank of the design covariance matrix. This bound is dimension-independent, and rank-independent for large enough separation. We then extend this result in two directions: (i) for the general parametric setup in asymptotic regime; (ii) for generalized linear models in the small-sample regime and under weak moment assumptions. In both cases, we derive sample complexity bounds of a similar form, even under misspecification. Our testing procedures only access through a certain functional of empirical risk. In addition, the number of observations that allows to reach statistical confidence in our tests does not allow to "resolve" the two models -- that is, recover up to prediction accuracy. These two properties allow to apply our framework in applied tasks where one would like to a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually by the identifying agent.
View on arXiv