As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. Convergence proof of the proposed algorithm is also provided. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular Reinforcement Learning (RL) experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data.
翻译:作为分布式学习范式,联邦学习联合会(FL)面临许多轮模式同步和聚合模式导致的沟通瓶颈问题; 多重数据导致缓慢趋同,使情况进一步恶化; 虽然对数据对受监督的FL的影响进行了广泛研究,但联邦强化学习联合会的相关调查仍处于初级阶段; 在本文中,我们首先确定基于政策梯度的FRL系统的数据差异类型和水平; 通过检查全球和地方目标功能之间的联系,我们证明,如果当地更新因地方和全球政策之间的距离完全变异(TV)而使情况进一步恶化,则当地培训可有利于全球目标; 尽管全球政策对受监督的FL的影响也进行了广泛的研究,但全球政策对数据差异也具有直接关联性; 根据理论结果,我们首先提出了基于政策梯度差异的Kullback-Leper(KL)的拟议处罚。 通过检查全球目标范围的功能差异,我们证明,如果当地培训可以使地方更新的升级因地方和全球政策之间的距离差差差而适当调,则当地更新当地数据会有利于分配空间的模型产出; 与拟议的全球培训方法的精度的精度的精度的精度,还证明,通过全球的精度培训的精度的精度和精度的精度的精度的精度的精度的精度的精度的精度的精度的精度,使当地学习方法,使当地培训的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度与全球的精度,也与全球的精度和度,使地性能度的精度的精度的精度的精度的精度的精度的精度的精度。