Massive-scale Online Feature Selection for Sparse Ultra-high Dimensional Data

27 September 2014

Tao Mei

Abstract

Feature Selection is an important technique in machine learning and pattern classification, especially for handling high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world, especially when data arrives sequentially. Recent years have witnessed some emerging feature selection techniques using online learning. Despite enjoying significant advantages in efficiency and scalability, the existing online feature selection methods are not always accurate enough, and still not sufficiently fast when handling massive-scale data with ultra-high dimensionality. To address the limitations, we propose a novel online feature selection method by exploiting second-order information with optimized implementations, which not only improves the learning efficacy, but also significantly enhances computational efficiency. We conduct extensive experiments for evaluating both learning accuracy and time cost of different algorithms on massive-scale synthetic and real-world datasets, including a dataset with billion-scale features. Our results show that our technique achieves highly competitive accuracy as compared with state-of-the-art batch feature selection methods, but consumes significantly low computational cost that is orders of magnitude lower than both state-of-the-art batch and online feature selection methods. On a billion-scale synthetic dataset (1-billion dimensions, 1-billion nonzero features, and 1-million samples), our algorithm took only eight minutes with a normal single machine.

View on arXiv

Comments on this paper