Coresets for Decision Trees of Signals

7 October 2021

Abstract

A $k$ -decision tree $t$ (or $k$ -tree) is a recursive partition of a matrix (2D-signal) into $k\geq 1$ block matrices (axis-parallel rectangles, leaves) where each rectangle is assigned a real label. Its regression or classification loss to a given matrix $D$ of $N$ entries (labels) is the sum of squared differences over every label in $D$ and its assigned label by $t$ . Given an error parameter $\varepsilon\in(0,1)$ , a $(k,\varepsilon)$ -coreset $C$ of $D$ is a small summarization that provably approximates this loss to \emph{every} such tree, up to a multiplicative factor of $1\pm\varepsilon$ . In particular, the optimal $k$ -tree of $C$ is a $(1+\varepsilon)$ -approximation to the optimal $k$ -tree of $D$ . We provide the first algorithm that outputs such a $(k,\varepsilon)$ -coreset for \emph{every} such matrix $D$ . The size $|C|$ of the coreset is polynomial in $k\log(N)/\varepsilon$ , and its construction takes $O(Nk)$ time. This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry. Experimental results on \texttt{sklearn} and \texttt{lightGBM} show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x $10$ , while keeping similar accuracy. Full open source code is provided.

View on arXiv

Comments on this paper