适格决策树：基于任意分割规则求解最优决策树问题的公理化框架 (Proper decision trees: An axiomatic framework for solving optimal decision tree problems with arbitrary splitting rules)

We present an axiomatic framework for analyzing the algorithmic properties of decision trees. This framework supports the classification of decision tree problems through structural and ancestral constraints within a rigorous mathematical foundation. The central focus of this paper is a special class of decision tree problems-which we term proper decision trees-due to their versatility and effectiveness. In terms of versatility, this class subsumes several well-known data structures, including binary space partitioning trees, K-D trees, and machine learning decision tree models. Regarding effectiveness, we prove that only proper decision trees can be uniquely characterized as K-permutations, whereas typical non-proper decision trees correspond to binary-labeled decision trees with substantially greater complexity. Using this formal characterization, we develop a generic algorithmic approach for solving optimal decision tree problems over arbitrary splitting rules and objective functions for proper decision trees. We constructively derive a generic dynamic programming recursion for solving these problems exactly. However, we show that memoization is generally impractical in terms of space complexity, as both datasets and subtrees must be stored. This result contradicts claims in the literature that suggest a trade-off between memoizing datasets and subtrees. Our framework further accommodates constraints such as tree depth and leaf size, and can be accelerated using techniques such as thinning. Finally, we extend our analysis to several non-proper decision trees, including the commonly studied decision tree over binary feature data, the binary search tree, and the tree structure arising in the matrix chain multiplication problem. We demonstrate how these problems can be solved by appropriately modifying or discarding certain axioms.

翻译：本文提出了一种用于分析决策树算法性质的公理化框架。该框架通过严谨的数学基础，支持基于结构与谱系约束对决策树问题进行分类。本文的核心聚焦于一类特殊的决策树问题——我们称之为适格决策树——因其通用性与高效性而备受关注。在通用性方面，该类问题涵盖多种经典数据结构，包括二叉空间划分树、K-D树以及机器学习中的决策树模型。在高效性方面，我们证明仅适格决策树可被唯一表征为K-置换，而典型的非适格决策树则对应具有更高复杂度的二值标记决策树。基于这一形式化表征，我们开发了一种通用算法框架，用于求解适格决策树在任意分割规则与目标函数下的最优决策树问题。我们通过构造性方法推导出求解此类问题的通用动态规划递推关系。然而，我们证明在空间复杂度层面，记忆化存储通常不可行，因为需要同时存储数据集与子树结构。这一结论与文献中关于数据集与子树存储权衡的主张相悖。本框架进一步兼容树深度与叶节点规模等约束条件，并可借助剪枝等技术实现加速计算。最后，我们将分析扩展至若干非适格决策树，包括广泛研究的二值特征数据决策树、二叉搜索树以及矩阵链乘法问题中衍生的树结构。我们通过适当修改或舍弃特定公理，展示了这些问题的求解路径。