A $k$-decision tree $t$ (or $k$-tree) is a recursive partition of a matrix (2D-signal) into $k\geq 1$ block matrices (axis-parallel rectangles, leaves) where each rectangle is assigned a real label. Its regression or classification loss to a given matrix $D$ of $N$ entries (labels) is the sum of squared differences over every label in $D$ and its assigned label by $t$. Given an error parameter $\varepsilon\in(0,1)$, a $(k,\varepsilon)$-coreset $C$ of $D$ is a small summarization that provably approximates this loss to \emph{every} such tree, up to a multiplicative factor of $1\pm\varepsilon$. In particular, the optimal $k$-tree of $C$ is a $(1+\varepsilon)$-approximation to the optimal $k$-tree of $D$. We provide the first algorithm that outputs such a $(k,\varepsilon)$-coreset for \emph{every} such matrix $D$. The size $|C|$ of the coreset is polynomial in $k\log(N)/\varepsilon$, and its construction takes $O(Nk)$ time. This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry. Experimental results on \texttt{sklearn} and \texttt{lightGBM} show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x$10$, while keeping similar accuracy. Full open source code is provided.
翻译:$k$( 或$k$- tree) 是一个折叠的分隔符, 将一个矩阵 (2D- signal) 折叠成 $k\ geq 1 块矩阵( 轴- 双极矩形, 叶子), 每个矩形都配有一个真正的标签。 它的回归或分类损失是给定矩阵 $N$ ( 标签) 的平方差数之和 $C 。 鉴于一个错误参数 $\ varepsilon\ in ( 0. 1) 美元, $( k, 瓦雷斯伦) $( 美元) 美元( 美元), 美元( 美元) 美元( 美元) 核心 $( D$ ) 的折叠叠合 。 我们将这个折叠成一个小相近于 emph} 这样的树, 折叠成一个乘系数 $10\ varepslon 。 尤其是, 美元 最合的美元( 1varrequen) ral- tal tal $ Dal $ D$( $ D$) 美元) 。