16
1

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Abstract

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples (x,y)(x, y) where yy is a noisy measurement of g(wx)g(w^* \cdot x). In particular, \new{the noisy labels are of the form} y=g(wx)+ξ+ϵy = g(w^* \cdot x) + \xi + \epsilon, where ξ\xi is the oblivious noise drawn independently of xx \new{and satisfies} Pr[ξ=0]o(1)\Pr[\xi = 0] \geq o(1), and ϵN(0,σ2)\epsilon \sim \mathcal N(0, \sigma^2). Our goal is to accurately recover a \new{parameter vector ww such that the} function g(wx)g(w \cdot x) \new{has} arbitrarily small error when compared to the true values g(wx)g(w^* \cdot x), rather than the noisy measurements yy. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which ξ+ϵ=0\xi + \epsilon = 0 is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated g(wx)+Ag(w^* \cdot x) + A for some real number AA, while also having large error when compared to g(wx)g(w^* \cdot x). This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

View on arXiv
Comments on this paper