FFKL:通过惩罚KL差异处理联邦加强学习中的数据多样化问题 (FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence)

As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular RL experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data.

翻译：作为分布式学习范式,联邦学习联合会(FL)面临许多轮模式同步和聚合模式导致的沟通瓶颈问题。不同数据导致缓慢趋同,使情况进一步恶化。虽然对数据对受监督的FL的影响进行了广泛研究,但联邦强化学习联合会的相关调查仍处于初级阶段。在本文件中,我们首先确定基于政策梯度的FRL系统的数据差异类型和水平。通过检查全球和地方目标功能之间的联系,我们证明,如果当地更新因地方和全球政策之间的完全差异(TV)而使当地更新受到当地与全球政策之间完全差异的制约,则当地培训会有利于全球目标。全球政策要从地方政策中学习的一个必要条件也是与异质性水平直接相关。根据理论结果,我们提出了基于政策梯度差异的Kullback-Leper(KL)处罚办法。通过检验参数空间模型差异的传统方法,我们证明,如果当地更新数据更新的平台因地方与全球政策之间的距离差而适当调节当地数据更新,则直接限制分配空间的模型产出。通过共同惩罚现行培训方法,使当地培训与全球标准之间的差别化。