Non-parametric maximum likelihood estimation encompasses a group of classic methods to estimate distribution-associated functions from potentially censored and truncated data, with extensive applications in survival analysis. These methods, including the Kaplan-Meier estimator and Turnbull's method, often result in overfitting, especially when the sample size is small. We propose an improvement to these methods by applying kernel smoothing to their raw estimates, based on a BIC-type loss function that balances the trade-off between optimizing model fit and controlling model complexity. In the context of a longitudinal study with repeated observations, we detail our proposed smoothing procedure and optimization algorithm. With extensive simulation studies over multiple realistic scenarios, we demonstrate that our smoothing-based procedure provides better overall accuracy in both survival function estimation and individual-level time-to-event prediction by reducing overfitting. Our smoothing procedure decreases the discrepancy between the estimated and true simulated survival function using interval-censored data by up to 49% compared to the raw un-smoothed estimate, with similar improvements of up to 41% and 23% in within-sample and out-of-sample prediction, respectively. Finally, we apply our method to real data on censored breast cancer diagnosis, which similarly shows improvement when compared to empirical survival estimates from uncensored data. We provide an R package, SISE, for implementing our penalized likelihood method.
翻译:非参数最大可能性估计包括一组典型方法,从可能受到审查的和短短的数据中估算分布相关功能,这些方法在生存分析中应用了广泛的应用。这些方法,包括卡普兰-梅耶估计仪和特恩布尔的方法,往往导致过度适应,特别是当抽样规模小时。我们建议改进这些方法,根据一种BIC型损失函数,将最优化适合模型和控制模型复杂性之间的权衡平衡在原始估计中,以平衡最佳模型和控制模型复杂性之间的平衡。在经过反复观察的纵向研究中,我们详细介绍了我们提议的平滑程序和优化算法。在对多种现实情景进行广泛的模拟研究后,我们通过减少过度匹配,我们基于平滑的程序在生存函数估计和个人级别的时间到实际预测方面提供了更好的总体准确性。我们平滑的程序通过间测算数据将估计和真实模拟生存功能之间的差异降低到49 %,而原始的未透析则以类似的方式改进了我们内部的41%和23 %的平滑程序和优化算算算算算算方法,我们最后对真实的SIS号数据进行了不固定的预测,从实际的预测中,我们对结果进行了不固定的预测。