12
17

Coresets for Decision Trees of Signals

Abstract

A kk-decision tree tt (or kk-tree) is a recursive partition of a matrix (2D-signal) into k1k\geq 1 block matrices (axis-parallel rectangles, leaves) where each rectangle is assigned a real label. Its regression or classification loss to a given matrix DD of NN entries (labels) is the sum of squared differences over every label in DD and its assigned label by tt. Given an error parameter ε(0,1)\varepsilon\in(0,1), a (k,ε)(k,\varepsilon)-coreset CC of DD is a small summarization that provably approximates this loss to \emph{every} such tree, up to a multiplicative factor of 1±ε1\pm\varepsilon. In particular, the optimal kk-tree of CC is a (1+ε)(1+\varepsilon)-approximation to the optimal kk-tree of DD. We provide the first algorithm that outputs such a (k,ε)(k,\varepsilon)-coreset for \emph{every} such matrix DD. The size C|C| of the coreset is polynomial in klog(N)/εk\log(N)/\varepsilon, and its construction takes O(Nk)O(Nk) time. This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry. Experimental results on \texttt{sklearn} and \texttt{lightGBM} show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x1010, while keeping similar accuracy. Full open source code is provided.

View on arXiv
Comments on this paper