Concentration Based Inference in High Dimensional Generalized Regression Models (I: Statistical Guarantees)

17 August 2018

Abstract

We develop simple and non-asymptotically justified methods for hypothesis testing about the coefficients ( $\theta^{*}\in\mathbb{R}^{p}$ ) in the high dimensional generalized regression models where $p$ can exceed the sample size. Given a function $h:\,\mathbb{R}^{p}\mapsto\mathbb{R}^{m}$ , we consider $H_{0}:\,h(\theta^{*}) = \mathbf{0}_{m}$ against $H_{1}:\,h(\theta^{*})\neq\mathbf{0}_{m}$ , where $m$ can be any integer in $\left[1,\,p\right]$ and $h$ can be nonlinear in $\theta^{*}$ . Our test statistics is based on the sample "quasi score" vector evaluated at an estimate $\hat{\theta}_{\alpha}$ that satisfies $h(\hat{\theta}_{\alpha})=\mathbf{0}_{m}$ , where $\alpha$ is the prespecified Type I error. By exploiting the concentration phenomenon in Lipschitz functions, the key component reflecting the dimension complexity in our non-asymptotic thresholds uses a Monte-Carlo approximation to mimic the expectation that is concentrated around and automatically captures the dependencies between the coordinates. We provide probabilistic guarantees in terms of the Type I and Type II errors for the quasi score test. Confidence regions are also constructed for the population quasi-score vector evaluated at $\theta^{*}$ . The first set of our results are specific to the standard Gaussian linear regression models; the second set allow for reasonably flexible forms of non-Gaussian responses, heteroscedastic noise, and nonlinearity in the regression coefficients, while only requiring the correct specification of $\mathbb{E}\left(Y_i | X_i\right)$ s. The novelty of our methods is that their validity does not rely on good behavior of $\left\Vert \hat{\theta}_\alpha - \theta^*\right\Vert_2$ (or even $n^{-1/2}\left\Vert X\left(\hat{\theta}_\alpha - \theta^*\right)\right\Vert_2$ in the linear regression case) nonasymptotically or asymptotically.

View on arXiv

Comments on this paper