We study practical data characteristics underlying federated learning, where non-i.i.d. data from clients have sparse features, and a certain client's local data normally involves only a small part of the full model, called a submodel. Due to data sparsity, the classical federated averaging (FedAvg) algorithm or its variants will be severely slowed down, because when updating the global model, each client's zero update of the full model excluding its submodel is inaccurately aggregated. Therefore, we propose federated submodel averaging (FedSubAvg), ensuring that the expectation of the global update of each model parameter is equal to the average of the local updates of the clients who involve it. We theoretically proved the convergence rate of FedSubAvg by deriving an upper bound under a new metric called the element-wise gradient norm. In particular, this new metric can characterize the convergence of federated optimization over sparse data, while the conventional metric of squared gradient norm used in FedAvg and its variants cannot. We extensively evaluated FedSubAvg over both public and industrial datasets. The evaluation results demonstrate that FedSubAvg significantly outperforms FedAvg and its variants.
翻译:我们研究了联邦学习中实际数据特征,其中来自客户端的非 i.i.d. 数据具有稀疏特征,某个客户端的本地数据通常只涉及全模型的一小部分,称为子模型。由于数据稀疏性,经典的联邦平均(FedAvg)算法或其变体将受到严重减缓,因为当更新全局模型时,每个客户端对完整模型的零更新除其子模型外,都将被不准确地聚合。因此,我们提出了联邦子模型平均(FedSubAvg),确保每个模型参数的全局更新的期望值等于涉及该参数的客户端的局部更新的平均值。我们通过导出一种称为逐元素梯度范数的新指标的上界,从理论上证明了 FedSubAvg 的收敛速度。特别地,该新指标可以表征稀疏数据上的联邦优化的收敛性,而 FedAvg 及其变体中使用的平方梯度范数的传统指标则不能。我们对公共和工业数据集进行了广泛的评估。评估结果显示,FedSubAvg 显著优于 FedAvg 及其变体。