Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work provides a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we propose two dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.
翻译:生存分析发现,在医疗、工程和社会科学方面广泛应用了医疗、工程和社会科学。然而,培训生存模型所需的数据往往被分发、不完整、检查和保密。在这方面,联合会学习可以被用来极大地提高在分发数据方面受过培训的模型的质量,同时保护用户隐私。然而,联盟生存分析仍处于早期开发阶段,没有共同的基准数据集来测试联合生存模型。这项工作提供了一种新的技术,从现有的非联合数据集开始,以可复制的方式构建现实的多元数据集。具体地说,我们提出基于Dirichlet分发的两种数据集分割算法,将每个数据样本分配给仔细选择的客户:数量偏差的分裂和标签偏差的分裂。此外,这些算法允许通过改变单一的超参数获得不同程度的异质性数据。最后,数字实验提供了一种从现有非联合数据集开始,以可复制的方式构建现实的多元数据集。我们建议根据Dirichlet的分布,采用两种数据集分割算法,将每个数据样本分配给一个仔细选择的客户:数量偏差的分裂和标签偏差的分裂。此外,这些算法允许通过改变单一的超常数参数来获得不同程度的概率分析。拟议方法,以便进行公共模拟环境的模拟分析。