A core step of every algorithm for learning regression trees is the selection of the best splitting variable from the available covariates and the corresponding split point. Early tree algorithms (e.g., AID, CART) employed greedy search strategies, directly comparing all possible split points in all available covariates. However, subsequent research showed that this is biased towards selecting covariates with more potential split points. Therefore, unbiased recursive partitioning algorithms have been suggested (e.g., QUEST, GUIDE, CTree, MOB) that first select the covariate based on statistical inference using p-values that are adjusted for the possible split points. In a second step a split point optimizing some objective function is selected in the chosen split variable. However, different unbiased tree algorithms obtain these p-values from different inference frameworks and their relative advantages or disadvantages are not well understood, yet. Therefore, three different popular approaches are considered here: classical categorical association tests (as in GUIDE), conditional inference (as in CTree), and parameter instability tests (as in MOB). First, these are embedded into a common inference framework encompassing parametric model trees, in particular linear model trees. Second, it is assessed how different building blocks from this common framework affect the power of the algorithms to select the appropriate covariates for splitting: observation-wise goodness-of-fit measure (residuals vs. model scores), dichotomization of residuals/scores at zero, and binning of possible split variables. This shows that specifically the goodness-of-fit measure is crucial for the power of the procedures, with model scores without dichotomization performing much better in many scenarios.
翻译:学习回归树的每一种计算法的核心步骤是选择从可用的共变数和相应的分割点中选择最佳分差变量。 早期树算法( 例如 AID, CART) 采用了贪婪的搜索策略, 直接比较所有可用的共变数中所有可能的分差点。 但是, 随后的研究显示, 这偏向于选择具有更多潜在分差点的共变法。 因此, 提出了不偏向的循环分配算法( 例如, QUEST, 图形化、 图形化、 CTree、 MOB), 首先, 使用按可能的分差点调整的 p- 值来选择共变差变量。 第二步, 优化某些目标函数化函数在选择的分差变量中选择。 但是, 不同的树平面算法从不同的推论框架中获取这些 p-, 其相对的利或劣势还远。 因此, 这里考虑三种流行的模型: 典型的绝对联系测试( 如图形化)、 货币化( 如交替的分差法( ) ) 和分差法( ) ( ) 和参数的分差法的计算方法, 等分差法的计算方法, 等分数的计算方法,, 和分差法的计算方法, 的计算方法, 等分法的分法的计算法的计算方法, 等比法的比法的计算, 的计算方法, 的计算方法, 等的计算方法, 的计算法的计算法的计算法的计算方法,, 等的计算方法,,, 的计算方法,,, 的计算法的计算法的计算法的计算法的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 的计算方法, 和分法的计算方法, 的计算法的计算法的计算法的计算方法, 的计算方法, 的计算法的计算法的计算方法, 和分法的计算法的计算法的计算方法, 的计算方法, 的计算法的计算方法, 的