Though successful, federated learning presents new challenges for machine learning, especially when the issue of data heterogeneity, also known as Non-IID data, arises. To cope with the statistical heterogeneity, previous works incorporated a proximal term in local optimization or modified the model aggregation scheme at the server side or advocated clustered federated learning approaches where the central server groups agent population into clusters with jointly trainable data distributions to take the advantage of a certain level of personalization. While effective, they lack a deep elaboration on what kind of data heterogeneity and how the data heterogeneity impacts the accuracy performance of the participating clients. In contrast to many of the prior federated learning approaches, we demonstrate not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive: (1) Dissimilar labels of clients (label skew) are not necessarily considered data heterogeneity, and (2) the principal angle between the agents' data subspaces spanned by their corresponding principal vectors of data is a better estimate of the data heterogeneity. Our code is available at https://github.com/MMorafah/FL-SC-NIID.
翻译:尽管成功,但联盟式学习给机器学习带来了新的挑战,特别是当数据差异问题,又称非IID数据出现时。为了应对统计差异性,以往的工程在服务器一侧的地方优化或修改模型汇总办法中加入了一个最接近的术语,或者提倡分组化学习方法,即中央服务器组群将人口纳入集群,并联合培训数据传播,以利用某种程度的个人化。虽然有效,但它们没有深入阐述数据差异性的类型和数据差异性如何影响参与客户的准确性。与许多先前的联邦化学习方法相比,我们不仅展示了当前组合中的数据差异性问题不一定是一个问题,而且事实上,这也有利于FL参与者。我们的观察是直观的:(1)客户的不同标签(标签 skew)不一定被视为数据异质性,以及(2)代理方数据子空间之间的主要角度,它们通过相应的主矢量AS/MAC-FLAS 数据源值来覆盖。一个更好的估测数据源值。ASFLA/MAC/MAFAS。