Machine learning services have been emerging in many data-intensive applications, and their effectiveness highly relies on large-volume high-quality training data. However, due to the increasing privacy concerns and data regulations, training data have been increasingly fragmented, forming distributed databases of multiple data silos (e.g., within different organizations and countries). To develop effective machine learning services, there is a must to exploit data from such distributed databases without exchanging the raw data. Recently, federated learning (FL) has been a solution with growing interests, which enables multiple parties to collaboratively train a machine learning model without exchanging their local data. A key and common challenge on distributed databases is the heterogeneity of the data distribution (i.e., non-IID) among the parties. There have been many FL algorithms to address the learning effectiveness under non-IID data settings. However, there lacks an experimental study on systematically understanding their advantages and disadvantages, as previous studies have very rigid data partitioning strategies among parties, which are hardly representative and thorough. In this paper, to help researchers better understand and study the non-IID data setting in federated learning, we propose comprehensive data partitioning strategies to cover the typical non-IID data cases. Moreover, we conduct extensive experiments to evaluate state-of-the-art FL algorithms. We find that non-IID does bring significant challenges in learning accuracy of FL algorithms, and none of the existing state-of-the-art FL algorithms outperforms others in all cases. Our experiments provide insights for future studies of addressing the challenges in data silos.
翻译:在许多数据密集的应用程序中出现了机器学习服务,其效力高度依赖于大量高质量的培训数据,然而,由于隐私关切和数据条例日益增多,培训数据日益分散,形成了多种数据库分布的数据库(例如,在不同组织和国家内)。为了开发有效的机器学习服务,必须利用这些分布的数据库中的数据,而不必交换原始数据。最近,联合学习(FL)是一个兴趣不断增长的解决方案,使多个当事方能够合作培训机器学习模型,而不必交换其当地数据。分布式数据库的一个主要和共同挑战是缔约方之间数据分配(即非IID)的异质性;为了解决非II数据库中学习的有效性问题,有许多FL算法的算法。然而,缺乏关于系统理解其利弊的实验性研究,因为先前的研究在缔约方之间有着非常僵硬的数据分割战略,这些战略很难具有代表性和透彻性。在本文件中,分发数据库的主要和共同挑战是缔约方之间数据分配(即非II的)数据分配(即非II)数据分布式的异性。我们提议在非II的实验中进行广泛的数据分析研究。