In this work, we study the problem of clustering survival data $-$ a challenging and so far under-explored task. We introduce a novel semi-supervised probabilistic approach to cluster survival data by leveraging recent advances in stochastic gradient variational inference. In contrast to previous work, our proposed method employs a deep generative model to uncover the underlying distribution of both the explanatory variables and censored survival times. We compare our model to the related work on clustering and mixture models for survival data in comprehensive experiments on a wide range of synthetic, semi-synthetic, and real-world datasets, including medical imaging data. Our method performs better at identifying clusters and is competitive at predicting survival times. Relying on novel generative assumptions, the proposed model offers a holistic perspective on clustering survival data and holds a promise of discovering subpopulations whose survival is regulated by different generative mechanisms.
翻译:在这项工作中,我们研究将生存数据集中在一起的问题,这是一项具有挑战性而且迄今探索不足的任务。我们采用了一种新的半监督的半概率方法,利用在随机梯度变异推断方面的最新进展,对生存数据集中起来。与以前的工作不同,我们提议的方法采用了一种深层次的基因化模型,以发现解释变量和受审查的生存时间的根本分布情况。我们比较了我们的模型,在一系列广泛的合成、半合成和现实世界数据集(包括医学成像数据)的综合实验中,将生存数据分组和混合模型的有关工作与生存数据分组和混合模型的有关工作,我们的方法在识别集群方面表现得更好,在预测生存时间时具有竞争力。根据新的基因化假设,拟议的模型提供了对生存数据组合的整体观点,并有望发现其生存受不同基因化机制制约的亚人口。