Since reinforcement learning algorithms are notoriously data-intensive, the task of sampling observations from the environment is usually split across multiple agents. However, transferring these observations from the agents to a central location can be prohibitively expensive in terms of the communication cost, and it can also compromise the privacy of each agent's local behavior policy. In this paper, we consider a federated reinforcement learning framework where multiple agents collaboratively learn a global model, without sharing their individual data and policies. Each agent maintains a local copy of the model and updates it using locally sampled data. Although having N agents enables the sampling of N times more data, it is not clear if it leads to proportional convergence speedup. We propose federated versions of on-policy TD, off-policy TD and Q-learning, and analyze their convergence. For all these algorithms, to the best of our knowledge, we are the first to consider Markovian noise and multiple local updates, and prove a linear convergence speedup with respect to the number of agents. To obtain these results, we show that federated TD and Q-learning are special cases of a general framework for federated stochastic approximation with Markovian noise, and we leverage this framework to provide a unified convergence analysis that applies to all the algorithms.
翻译:由于强化学习算法是臭名昭著的数据密集型,来自环境的抽样观测任务通常在多个代理商之间分割。然而,将这些观测从代理商转移到一个中心地点,就通信成本而言,费用可能高得令人望而却步,这也可能损害每个代理商当地行为政策的隐私。在本文件中,我们考虑一个联合强化学习框架,在这个框架中,多个代理商合作学习全球模型,而不分享其个人数据和政策。每个代理商都持有一份本地模型副本,并使用当地抽样数据加以更新。虽然有N代理商能够对N倍以上的数据进行取样,但不清楚这种观测是否导致比例趋同速度的加速。我们提出了政策性TD、脱离政策的TD和Q学习的联邦化版本,并分析了其趋同性。对于所有这些算法,我们最了解的是,我们首先考虑Markovian的噪音和多个本地更新,并证明与代理商数量有线性趋同速度。为了获得这些结果,我们显示,进化TD和Q学习是通用的统称总框架的特殊案例,我们用Federated tochachstalgaldalgalgal vical to the the thesligillationsligalviews dust thes dust dust to thesqolviolviolgald 提供所有的杠杆分析。