Random survival forest and survival trees are popular models in statistics and machine learning. However, there is a lack of general understanding regarding consistency, splitting rules and influence of the censoring mechanism. In this paper, we investigate the statistical properties of existing methods from several interesting perspectives. First, we show that traditional splitting rules with censored outcomes rely on a biased estimation of the within-node failure distribution. To exactly quantify this bias, we develop a concentration bound of the within-node estimation based on non i.i.d. samples and apply it to the entire forest. Second, we analyze the entanglement between the failure and censoring distributions caused by univariate splits, and show that without correcting the bias at an internal node, survival tree and forest models can still enjoy consistency under suitable conditions. In particular, we demonstrate this property under two cases: a finite-dimensional case where the splitting variables and cutting points are chosen randomly, and a high-dimensional case where the covariates are weakly correlated. Our results can also degenerate into an independent covariate setting, which is commonly used in the random forest literature for high-dimensional sparse models. However, it may not be avoidable that the convergence rate depends on the total number of variables in the failure and censoring distributions. Third, we propose a new splitting rule that compares bias-corrected cumulative hazard functions at each internal node. We show that the rate of consistency of this new model depends only on the number of failure variables, which improves from non-bias-corrected versions. We perform simulation studies to confirm that this can substantially benefit the prediction error.
翻译:随机生存的森林和生存树是统计和机器学习中流行的模式。 然而, 在一致性、 分解规则和审查机制的影响方面缺乏普遍的理解。 在本文中, 我们从几个有趣的角度调查现有方法的统计特性。 首先, 我们显示传统将规则与审查结果分开的规则依赖于对节点内故障分布的偏差估计。 为了准确地量化这种偏差, 我们根据非i. d. 样本, 并应用到整个森林中, 形成了一个在节点内估计的集中。 第二, 我们分析非透明分解造成的失败和审查分布之间的纠缠绕, 并显示在不纠正内部节点、 生存树和森林模型的偏差的同时, 在适当的条件下, 我们仍可以保持一致性。 特别是, 在两种情况下, 我们展示了这种特性: 局部差异变量和切分点是随机选择的, 而在高度模型中, 我们的变差只是比较起来的。 我们的结果也可以演化成一个独立的变异性环境, 这在森林文献中通常使用的不精确的分差分布, 在高维度模型中, 我们可能要显示这种误差的顺序的分差率率的计算。