State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data.
翻译:当客户的数据分布不同时,最先进的联合学习方法可能比集中式的学习方法效果要差得多。对于神经网络,即使集中式的SGD很容易找到一种对所有客户都同时具有性能的解决方案,目前的联合优化方法也未能趋同到一个类似的解决方案。我们表明,这种性能差异在很大程度上可以归因于非混凝土带来的优化挑战。具体地说,我们发现,网络早期的层确实学到了有用的特征,但最后层没有使用这些特征。这就是,对这个非凝固型问题应用的结合优化扭曲了最后层的学习。我们利用这一观察,我们建议了一个培训-粉碎-Train(TCT)程序来回避这个问题:首先,使用现成方法学习功能(例如FedAvg);然后,优化从网络的经验性红内核内核近似学中获得的固化问题。我们的技术使FMNISIP的精确度提高到+36%,当客户数据不精确时,在CIFFAR10上达到+37%。