We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. Our work is the first to study this particular subgroup problem, for which we make several contributions. Subgroup discovery methods generally require a "quality function" in order to sift through and select the most advantageous subgroups. We first examine why existing natural choices for quality functions are insufficient to solve the subgroup discovery problem for the Cox model. To address the shortcomings of existing metrics, we introduce two technical innovations: the *expected prediction entropy (EPE)*, a novel metric for evaluating survival models which predict a hazard function; and the *conditional rank statistics (CRS)*, a statistical object which quantifies the deviation of an individual point to the distribution of survival times in an existing subgroup. We study the EPE and CRS theoretically and show that they can solve many of the problems with existing metrics. We introduce a total of eight algorithms for the Cox subgroup discovery problem. The main algorithm is able to take advantage of both the EPE and the CRS, allowing us to give theoretical correctness results for this algorithm in a well-specified setting. We evaluate all of the proposed methods empirically on both synthetic and real data. The experiments confirm our theory, showing that our contributions allow for the recovery of a ground-truth subgroup in well-specified cases, as well as leading to better model fit compared to naively fitting the Cox model to the whole dataset in practical settings. Lastly, we conduct a case study on jet engine simulation data from NASA. The discovered subgroups uncover known nonlinearities/homogeneity in the data, and which suggest design choices which have been mirrored in practice.
翻译:本研究探讨生存分析中的亚组发现问题,其目标是在数据中寻找一个可解释的子集,使得Cox模型在该子集上具有高精度。我们的工作是首次针对这一特定亚组问题的系统性研究,并作出以下贡献:亚组发现方法通常需要借助"质量函数"来筛选和选择最优亚组。我们首先分析了为何现有质量函数的自然选择无法解决Cox模型的亚组发现问题。为弥补现有指标的缺陷,我们提出两项技术创新:*期望预测熵(EPE)*——一种评估预测风险函数的生存模型的新指标;以及*条件秩统计量(CRS)*——用于量化个体点与现有亚组中生存时间分布偏差的统计对象。我们从理论上研究了EPE和CRS,证明它们能解决现有指标的诸多问题。针对Cox亚组发现问题,我们共提出八种算法。主要算法能够同时利用EPE和CRS的优势,使我们在设定明确的场景中为该算法提供理论正确性证明。我们在合成数据与真实数据上对所有提出方法进行了实证评估。实验结果验证了理论分析,表明我们的贡献能够在设定明确的案例中恢复真实亚组,并在实际场景中相比简单对整个数据集拟合Cox模型获得更好的模型拟合效果。最后,我们对NASA的喷气发动机仿真数据进行了案例研究。发现的亚组揭示了数据中已知的非线性/同质性特征,这些特征所提示的设计选择已在实践中得到印证。