In the AGEMAP genomics study, researchers were interested in detecting genes related to age in a variety of tissue types. After not finding many age-related genes in some of the analyzed tissue types, the study was criticized for having low power. It is possible that the low power is due to the presence of important unmeasured variables, and indeed we find that a latent factor model appears to explain substantial variability not captured by measured covariates. We propose including the estimated latent factors in a multiple regression model. The key difficulty in doing so is assigning appropriate degrees of freedom to the estimated factors to obtain unbiased error variance estimators and enable valid hypothesis testing. When the number of responses is large relative to the sample size, treating the estimated factors like observed covariates leads to a downward bias in the variance estimates. Many ad-hoc solutions to this problem have been proposed in the literature without the backup of a careful theoretical analysis. Using recent results from random matrix theory, we derive a simple, easy to use expression for degrees of freedom. Our estimate gives a principled alternative to ad-hoc approaches in common use. Extensive simulation results show excellent agreement between the proposed estimator and its theoretical value. Applying our methodology to the AGEMAP genomics study, we found an order of magnitude increase in the number of significant genes. Although we focus on the AGEMAP study, the methods developed in this paper are widely applicable to other multivariate models, and thus are of independent interest.
View on arXiv