ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1508.03747
28
13
v1v2v3v4v5 (latest)

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

15 August 2015
S. Bruce
Zeda Li
Hsiang-Chieh Yang
S. Mukhopadhyay
ArXiv (abs)PDFHTML
Abstract

The big data era is here but where are the tools to analyze them? Dramatic increases in the size of datasets have made traditional "centralized" statistical inference techniques prohibitive. Surprisingly very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity of a single-machine system. Indeed, the topic of big data statistical inference is very much in its nascent stage of development. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the basic fundamental data modeling principles that were developed for `small' data over the last century? To address this problem we present MetaLP--a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis where statistical inference meets big data technology. This generic statistical approach addresses two main challenges of large datasets: (1) massive volume and (2) variety or mixed data problem. We apply this general theory in the context of a nonparametric two sample inference algorithm for Expedia personalized hotel recommendation engine based on 10 million records of search results. Furthermore, we show how this broad statistical learning scheme (MetaLP) can be successfully adapted for `small' data in resolving the challenging problem of Simpson's paradox. The R-scripts for MetaLP-based parallel processing of massive data by integrating with the Hadoop's MapReduce framework are available as supplementary materials.

View on arXiv
Comments on this paper