医学图像分类的不平衡和集中化的联邦学习数据分发 (Federated Learning with Imbalanced and Agglomerated Data Distribution for Medical Image Classification)

Federated learning (FL), training deep models from decentralized data without privacy leakage, has drawn great attention recently. Two common issues in FL, namely data heterogeneity from the local perspective and class imbalance from the global perspective have limited FL's performance. These two coupling problems are under-explored, and existing few studies may not be sufficiently realistic to model data distributions in practical sceneries (e.g. medical sceneries). One common observation is that the overall class distribution across clients is imbalanced (e.g. common vs. rare diseases) and data tend to be agglomerated to those more advanced clients (i.e., the data agglomeration effect), which cannot be modeled by existing settings. Inspired by real medical imaging datasets, we identify and formulate a new and more realistic data distribution denoted as L2 distribution where global class distribution is highly imbalanced and data distributions across clients are imbalanced but forming a certain degree of data agglomeration. To pursue effective FL under this distribution, we propose a novel privacy-preserving framework named FedIIC that calibrates deep models to alleviate bias caused by imbalanced training. To calibrate the feature extractor part, intra-client contrastive learning with a modified similarity measure and inter-client contrastive learning guided by shared global prototypes are introduced to produce a uniform embedding distribution of all classes across clients. To calibrate the classification heads, a softmax cross entropy loss with difficulty-aware logit adjustment is constructed to ensure balanced decision boundaries of all classes. Experimental results on publicly-available datasets demonstrate the superior performance of FedIIC in dealing with both the proposed realistic modeling and the existing modeling of the two coupling problems.

翻译：联邦学习(FL),从分散的数据中培训不泄露隐私的深度模型(FL),这是来自分散数据的深层模型,最近引起极大关注。FL的两个共同问题,即从当地角度的数据差异和从全球角度的阶级不平衡,都限制了FL的绩效。这两个混合问题探索不足,而现有的少数研究可能不够现实,无法在实际环境(例如医疗场景)中模拟数据分布(例如,医疗场景)。一个共同的观察是,客户之间的总体类别分布不平衡(例如,常见病和罕见病),数据往往被聚集到更先进的客户(例如,从当地角度看的数据差异和从全球角度看,数据类别不平衡),而数据倾向于聚集到更先进的客户(例如,数据凝聚效应),无法以现有的环境为模范。根据真正的医学成像型数据集,我们确定和制定新的和更加现实的数据分布,说明全球等级分布高度不平衡,但客户之间的数据分布有一定程度的平衡度。我们提议,在这种分布下,一个名为FDIIC的隐私保留框架,以现有的更精确级数据分类为标准级分类,通过校正校正的分类, 校正的分类,使得公司内部的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正结果成为了一种校正的校正的校正的校正的校正的校正的校正的校正的校正结果。