Federated learning allows for the training of machine learning models on multiple decentralized local datasets without requiring explicit data exchange. However, data pre-processing, including strategies for handling missing data, remains a major bottleneck in real-world federated learning deployment, and is typically performed locally. This approach may be biased, since the subpopulations locally observed at each center may not be representative of the overall one. To address this issue, this paper first proposes a more consistent approach to data standardization through a federated model. Additionally, we propose Fed-MIWAE, a federated version of the state-of-the-art imputation method MIWAE, a deep latent variable model for missing data imputation based on variational autoencoders. MIWAE has the great advantage of being easily trainable with classical federated aggregators. Furthermore, it is able to deal with MAR (Missing At Random) data, a more challenging missing-data mechanism than MCAR (Missing Completely At Random), where the missingness of a variable can depend on the observed ones. We evaluate our method on multi-modal medical imaging data and clinical scores from a simulated federated scenario with the ADNI dataset. We compare Fed-MIWAE with respect to classical imputation methods, either performed locally or in a centralized fashion. Fed-MIWAE allows to achieve imputation accuracy comparable with the best centralized method, even when local data distributions are highly heterogeneous. In addition, thanks to the variational nature of Fed-MIWAE, our method is designed to perform multiple imputation, allowing for the quantification of the imputation uncertainty in the federated scenario.
翻译:联邦学习允许在多个分散的本地数据集上训练机器学习模型,而不需要显式的数据交换。但是,数据预处理,包括处理缺失数据的策略,仍然是实际联邦学习部署中的主要瓶颈,并且通常在本地执行。这种方法可能是有偏的,因为每个中心本地观察到的子族群可能不代表整体族群。为了解决这个问题,本文首先提出了一个通过联邦模型进行数据标准化的更一致的方法。此外,我们还提出了Fed-MIWAE,这是MIWAE的联邦版本,MIWAE是一种基于变分自编码器的深度潜在变量模型,用于处理缺失数据填充。MIWAE的巨大优势是可以用经典联合聚合器轻松训练。此外,它能够处理MAR(随机缺失)数据,这是一种更具挑战性的缺失数据机制,比MCAR(完全随机缺失)更具挑战性,其中变量的缺失可能取决于已观察到的变量。我们使用含有多个模式的医学影像数据和ADNI数据集的模拟联邦方案中的临床得分来评估我们的方法。我们将Fed-MIWAE与传统的本地或集中的填充方法进行比较。即使在本地数据分布非常异构的情况下,Fed-MIWAE仍可以实现与最佳集中方法相当的填充准确度。此外,由于Fed-MIWAE是一种变分方法,我们的方法被设计用于执行多重填充,从而允许在联邦场景中量化填充不确定性。