Due to the increasing privacy concerns and data regulations, training data have been increasingly fragmented, forming distributed databases of multiple ``data silos'' (e.g., within different organizations and countries). To develop effective machine learning services, there is a must to exploit data from such distributed databases without exchanging the raw data. Recently, federated learning (FL) has been a solution with growing interests, which enables multiple parties to collaboratively train a machine learning model without exchanging their local data. A key and common challenge on distributed databases is the heterogeneity of the data distribution (i.e., non-IID) among the parties. There have been many FL algorithms to address the learning effectiveness under non-IID data settings. However, there lacks an experimental study on systematically understanding their advantages and disadvantages, as previous studies have very rigid data partitioning strategies among parties, which are hardly representative and thorough. In this paper, to help researchers better understand and study the non-IID data setting in federated learning, we propose comprehensive data partitioning strategies to cover the typical non-IID data cases. Moreover, we conduct extensive experiments to evaluate state-of-the-art FL algorithms. We find that non-IID does bring significant challenges in learning accuracy of FL algorithms, and none of the existing state-of-the-art FL algorithms outperforms others in all cases. Our experiments provide insights for future studies of addressing the challenges in ``data silos''.
翻译:由于隐私关切和数据管理日益加剧,培训数据日益支离破碎,形成了多“数据发射井”分布式数据库(例如,在不同组织和国家内部)。为了发展有效的机器学习服务,必须利用这些分布式数据库的数据,而不必交换原始数据。最近,联邦学习(FL)是一个解决办法,兴趣日益浓厚,使多方能够合作培训机器学习模式,而不必交换当地数据。分布式数据库的一个关键和共同挑战是缔约方之间数据分配(即非IID)的不均匀性。为了解决非IID数据设置下的学习效力问题,有许多FL算法。然而,缺乏一项实验性研究,系统地了解这些数据库的优缺点,因为以前的研究在各方之间有着非常僵硬的数据分配战略,而这种战略几乎没有代表性和透彻。在本文件中,为了帮助研究人员更好地理解和研究非II数据采集的非II数据,我们提出了全面的数据分配战略,以覆盖典型的非II数据案例。此外,我们在非II号数据分析中,我们进行了广泛的实验,我们没有从FL的精确性案例中学习了目前FL的精确性。