Nonparametric distributed learning architecture for big data: Algorithm and applications

Scott Bruce, Zeda Li, Hsiang Chieh Yang, Subhadeep Mukhopadhyay

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Dramatic increases in the size and complexity of modern datasets have made traditional centralized statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for small data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous local inferences from partitioned data using meta-analysis techniques to arrive at the global inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.

Original languageEnglish (US)
Article number8303780
Pages (from-to)166-179
Number of pages14
JournalIEEE Transactions on Big Data
Volume5
Issue number2
DOIs
StatePublished - Jun 1 2019
Externally publishedYes

Keywords

  • data-parallelism
  • distributed statistical learning
  • heterogeneity
  • LP transformation
  • meta-analysis
  • Nonparametric mixed data modeling

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Nonparametric distributed learning architecture for big data: Algorithm and applications'. Together they form a unique fingerprint.

Cite this