We revisit binary decision trees from the perspective of partitions of the data. We introduce the notion of partitioning function, and we relate it to the growth function and to the VC dimension. We consider three types of features: real-valued, categorical ordinal and categorical nominal, with different split rules for each. For each feature type, we upper bound the partitioning function of the class of decision stumps before extending the bounds to the class of general decision tree (of any fixed structure) using a recursive approach. Using these new results, we are able to find the exact VC dimension of decision stumps on examples of $\ell$ real-valued features, which is given by the largest integer $d$ such that $2\ell \ge \binom{d}{\lfloor\frac{d}{2}\rfloor}$. Furthermore, we show that the VC dimension of a binary tree structure with $L_T$ leaves on examples of $\ell$ real-valued features is in $O(L_T \log(L_T\ell))$. Finally, we elaborate a pruning algorithm based on these results that performs better than the cost-complexity and reduced-error pruning algorithms on a number of data sets, with the advantage that no cross-validation is required.
翻译:我们从数据分区的角度重新审视二进制决定树。 我们引入了分割函数的概念, 并将它与增长函数和 VC 维度联系起来。 我们考虑三种特征类型: 实际价值、 绝对正数和绝对名义, 每种类型都有不同的分割规则。 对于每一种特性类型, 我们先将决定立桩类别的分割功能上拉紧, 再使用循环方法将二进制树( 任何固定结构) 的界限扩展至普通决定树( 任何固定结构) 的类别。 使用这些新结果, 我们能够找到决定立桩的精确 VC 维度, 以 $\ ell 和 $ 实际价值为示例。 最后, 我们根据最大的整数 $2\ ell\ d\\ t\ t\ t\ d\ tflop\\ d\\\\\\ d\\\\\\\\\\\\\\\\\\\\\ rprofload} $ 来给出的 。 此外, 我们显示二进树结构的VC维度, 以$ $ 美元为实际价值的假 。 。 我们用 $\ exvalue exval prrun r r comprilling comlicalation r r comlationalgalation lax lax lax lax lax lax lax lax lax lax