Federated learning (FL) has been proposed as a method to train a model on different units without exchanging data. This offers great opportunities in the healthcare sector, where large datasets are available but cannot be shared to ensure patient privacy. We systematically investigate the effectiveness of FL on the publicly available eICU dataset for predicting the survival of each ICU stay. We employ Federated Averaging as the main practical algorithm for FL and show how its performance changes by altering three key hyper-parameters, taking into account that clients can significantly vary in size. We find that in many settings, a large number of local training epochs improves the performance while at the same time reducing communication costs. Furthermore, we outline in which settings it is possible to have only a low number of hospitals participating in each federated update round. When many hospitals with low patient counts are involved, the effect of overfitting can be avoided by decreasing the batchsize. This study thus contributes toward identifying suitable settings for running distributed algorithms such as FL on clinical datasets.
翻译:联邦学习(FL)被提议为一种方法,用于培训不同单位的模型而无需交换数据,这为保健部门提供了巨大的机会,因为保健部门有大量的数据集可供使用,但无法共享,以确保患者隐私。我们系统地调查公众可公开获得的 eICU 数据集的FL 有效性,以预测每个ICU 停留的存活情况。我们采用Federal Average作为FL的主要实用算法,并显示其性能变化,为此改造了三个关键超参数,同时考虑到客户的大小差异很大。我们发现,在许多环境里,大量的当地培训区提高了性能,同时降低了通信成本。此外,我们概述了在哪些情况下,只有少量的医院可以参加每个联合更新周期。当涉及许多病人人数较少的医院时,通过减少分级规模可以避免过度装配。因此,这项研究有助于确定在临床数据集上使用FL等分布的算法的适当环境。