39
0

Robust estimation of a regression function in exponential families

Abstract

We observe nn pairs X1=(W1,Y1),,Xn=(Wn,Yn)X_{1}=(W_{1},Y_{1}),\ldots,X_{n}=(W_{n},Y_{n}) of independent random variables and assume, although this might not be true, that for each i{1,,n}i\in\{1,\ldots,n\}, the conditional distribution of YiY_{i} given WiW_{i} belongs to a given exponential family with real parameter θi=θ(Wi)\theta_{i}^{\star}=\boldsymbol{\theta}^{\star}(W_{i}) the value of which is a function θ\boldsymbol{\theta}^{\star} of the covariate WiW_{i}. Given a model Θ\boldsymbol{\overline\Theta} for θ\boldsymbol{\theta}^{\star}, we propose an estimator θ^\boldsymbol{\widehat \theta} with values in Θ\boldsymbol{\overline\Theta} the construction of which is independent of the distribution of the WiW_{i} and that possesses the properties of being robust to contamination, outliers and model misspecification. We establish non-asymptotic exponential inequalities for the upper deviations of a Hellinger-type distance between the true distribution of the data and the estimated one based on θ^\boldsymbol{\widehat \theta}. Under a suitable parametrization of the exponential family, we deduce a uniform risk bound for θ^\boldsymbol{\widehat \theta} over the class of H\"olderian functions and we prove the optimality of this bound up to a logarithmic factor. Finally, we provide an algorithm for calculating θ^\boldsymbol{\widehat \theta} when θ\boldsymbol{\theta}^{\star} is assumed to belong to functional classes of low or medium dimensions (in a suitable sense) and, on a simulation study, we compare the performance of θ^\boldsymbol{\widehat \theta} to that of the MLE and median-based estimators. The proof of our main result relies on an upper bound, with explicit numerical constants, on the expectation of the supremum of an empirical process over a VC-subgraph class. This bound can be of independent interest.

View on arXiv
Comments on this paper