Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

15 August 2015

Abstract

This article proposes a generic statistical approach, called MetaLP, that addresses two main challenges of large datasets: (1) massive volume, and (2) variety or mixed data problem. We apply this general theory in the context of variable selection by developing a nonparametric distributed statistical inference framework that allows us to extend traditional and novel statistical methods to massive data that cannot be processed and analyzed all at once using standard statistical software. Our proposed algorithm leverages the power of distributed and parallel computing architecture, which makes it more scalable for large-scale data analysis. Furthermore, we show that how this broad statistical learning scheme (MetaLP) can be successfully adapted for `small' data like resolving the challenging problem of Simpson's paradox. The R-scripts for MetaLP-based parallel processing of massive data by integrating with the Hadoop's MapReduce framework are available as supplementary materials.

View on arXiv

Comments on this paper