278
v1v2v3v4 (latest)

The xyz algorithm for fast interaction search in high-dimensional data

Abstract

When performing regression on a dataset with pp variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least O(p2)\mathcal{O}(p^2) if done naively. This cost can be prohibitive if pp is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in pp. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires O(pα)\mathcal{O}(p^\alpha) operations for 1<α<21 < \alpha < 2 depending on their strength. The underlying idea is to transform interaction search into a closestpair problem which can be solved efficiently in subquadratic time. The algorithm is called xyz\mathit{xyz} and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than 101110^{11} interactions can be screened in under 280280 seconds with a single-core 1.21.2 GHz CPU.

View on arXiv
Comments on this paper