Large-scale Online Feature Selection for Ultra-high Dimensional Sparse Data

27 September 2014

Tao Mei

Abstract

Feature selection is an important technique in machine learning and pattern classification, especially when dealing with high-dimensional data. Most existing methods are neither accurate enough nor sufficiently fast when handling large-scale ultra-high dimensional data. To overcome this open challenge, we present a simple but smart second-order online feature selection algorithm that is extremely efficient, scalable to large scale and ultra-high dimensionality, and effective. Unlike conventional methods, the proposed algorithm effectively exploits the second-order information, trying to select the most confident weights while keeping the distribution close to the non-truncated distribution. We conducted extensive experiments by comparing both online and batch feature selection techniques. Our promising results show that our new technique not only outperforms the existing online algorithms, but also achieves highly competitive accuracy as the state-of-the-art batch feature selection methods while consuming orders of magnitude lower computational cost. Impressively, on a billion-scale synthetic dataset (1-billion dimensions, 1-billion nonzero features, and 1-million samples), our algorithm took only eight minutes on a normal single machine.

View on arXiv

Comments on this paper