Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently observed drop-out, there are often also organizational and financial challenges, which can lead to reduced data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline data available of patients with similar characteristics and background information, e.g., from patients that fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third level of supervision in the context of survival analysis, apart from fully observed and censored instances, we also include unlabeled instances. We propose three approaches to deal with this novel setting and provide an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results demonstrate that all approaches are able to increase the predictive performance over independent test data. We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper approach generally provides the best results, often achieving high improvements, compared to not using unlabeled data.
翻译:许多临床研究要求病人在一段时间内跟踪。这具有挑战性:除了经常观察到的辍学现象之外,还经常存在组织和财政挑战,这可能导致数据收集减少,反过来又可能使随后的分析复杂化。相反,具有类似特征和背景资料的病人,例如来自研究时间窗口外的病人,往往有大量基线数据,例如来自处于研究时间窗口外的病人。在本篇文章中,我们调查我们是否可以从纳入这种未贴标签的数据案例来预测准确的存活时间中受益。换句话说,除了完全观察和审查的情况外,我们还在生存分析方面引入第三层监督,我们还引入了未贴标签的情况。我们提出了三种办法来处理这一新的环境,并对15个实际的临床和基因表达生存数据集提供经验性比较。我们的结果表明,所有办法都能够提高独立测试数据的预测性能。我们还表明,将受审查的数据所提供的部分监督纳入半监督的包装方法,通常提供最佳的结果,往往取得高改进,而不是使用未贴标签的数据。