非IID数据硅联合会学习:一项实验研究 (Federated Learning on Non-IID Data Silos: An Experimental Study)

Due to the increasing privacy concerns and data regulations, training data have been increasingly fragmented, forming distributed databases of multiple "data silos" (e.g., within different organizations and countries). To develop effective machine learning services, there is a must to exploit data from such distributed databases without exchanging the raw data. Recently, federated learning (FL) has been a solution with growing interests, which enables multiple parties to collaboratively train a machine learning model without exchanging their local data. A key and common challenge on distributed databases is the heterogeneity of the data distribution among the parties. The data of different parties are usually non-independently and identically distributed (i.e., non-IID). There have been many FL algorithms to address the learning effectiveness under non-IID data settings. However, there lacks an experimental study on systematically understanding their advantages and disadvantages, as previous studies have very rigid data partitioning strategies among parties, which are hardly representative and thorough. In this paper, to help researchers better understand and study the non-IID data setting in federated learning, we propose comprehensive data partitioning strategies to cover the typical non-IID data cases. Moreover, we conduct extensive experiments to evaluate state-of-the-art FL algorithms. We find that non-IID does bring significant challenges in learning accuracy of FL algorithms, and none of the existing state-of-the-art FL algorithms outperforms others in all cases. Our experiments provide insights for future studies of addressing the challenges in "data silos".

翻译：由于隐私关切和数据管理日益加剧,培训数据日益支离破碎,形成了多个“数据发射井”(例如不同组织和国家内部)的分布式数据库。为了开发有效的机器学习服务,必须利用这些分布式数据库的数据,而无需交换原始数据。最近,联谊学习(FL)是一个解决办法,兴趣日益浓厚,使多方能够合作培训机器学习模式,而无需交换其当地数据。分布式数据库的一个关键和共同挑战是缔约方之间数据分配的不均匀性。不同缔约方的数据通常不独立和同样地分布(即非IID)。为了解决非II数据设置下的学习效力问题,有许多FL算法的算法。然而,缺乏关于系统了解其利弊的实验性研究,因为先前的研究使缔约方之间数据分配战略非常僵硬,很难有代表性和透彻。在FL算法学习过程中,我们建议全面的数据分配战略,解决了FL的不均匀数据分析法的不精确性,在FL数据案例中,我们没有为FL的典型的非逻辑进行。