基于树方法处理长度偏差生存数据 (Tree-based methods for length-biased survival data)

Left-truncated survival data commonly arise in prevalent cohort studies, where only individuals who have experienced disease onset and survived until enrollment in the study. When the onset process follows a stationary Poisson process, the resulting data are length-biased. This sampling mechanism induces a selection bias towards longer survival individuals, and statistical methods for traditional survival data are not directly applicable. While tree-based methods developed for left-truncated data can be applied, they may be inefficient for length-biased data, as they do not account for the distribution of truncation times. To address this, we propose new survival trees and forests for length-biased right-censored data within the conditional inference framework. Our approach uses a score function derived from the full likelihood to construct permutation test statistics for variable splitting. For survival prediction, we consider two estimators of the unbiased survival function, differing in statistical efficiency and computational complexity. These elements enhance efficiency in tree construction and improve accuracy of survival prediction in ensemble settings. Simulation studies demonstrate efficiency gains in both tree recovery and survival prediction, often exceeding the gains from ensembling alone. We further illustrate the utility of the proposed methods using lung cancer data from the Cancer Public Library Database, a nationwide cancer registry in South Korea.

翻译：左截断生存数据常见于流行队列研究中，其中仅纳入经历疾病发病且在入组时仍存活的个体。当发病过程服从平稳泊松过程时，所得数据具有长度偏差特性。这种抽样机制会导致对较长生存个体的选择偏差，传统生存数据的统计方法无法直接适用。虽然针对左截断数据开发的树方法可以应用，但其对长度偏差数据可能效率不足，因其未考虑截断时间的分布特性。为此，我们在条件推断框架内提出了适用于长度偏差右删失数据的新型生存树与生存森林方法。本方法采用从完全似然推导的得分函数构建变量分割的置换检验统计量。在生存预测方面，我们考虑了两种无偏生存函数估计量，二者在统计效率与计算复杂度上各有特点。这些设计元素提升了树构建效率，并增强了集成环境下生存预测的准确性。模拟研究证明了该方法在树结构还原与生存预测方面均能获得效率提升，其增益常超越单纯集成方法的效果。我们进一步通过韩国国家癌症登记机构——癌症公共数据库中的肺癌数据，展示了所提方法的实用价值。