Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

15 August 2015

Abstract

The big data era is here but where are the tools to analyze them? Dramatic increases in the size of datasets have made traditional "centralized" statistical inference techniques prohibitive. Surprisingly very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity of a single-machine system. Indeed, the topic of big data statistical inference is very much in its nascent stage of development. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the basic fundamental data modeling principles that were developed for `small' data over the last century? To address this problem we present MetaLP--a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis where statistical inference meets big data technology. This generic statistical approach addresses two main challenges of large datasets: (1) massive volume and (2) variety or mixed data problem. We apply this general theory in the context of a nonparametric two sample inference algorithm for Expedia personalized hotel recommendation engine based on 10 million records of search results. Furthermore, we show how this broad statistical learning scheme (MetaLP) can be successfully adapted for `small' data in resolving the challenging problem of Simpson's paradox. The R-scripts for MetaLP-based parallel processing of massive data by integrating with the Hadoop's MapReduce framework are available as supplementary materials.

View on arXiv

Comments on this paper