分发的TD(0)和几乎无来文 (Distributed TD(0) with Almost No Communication)

We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model. In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, in the worst case, underperformed the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.

翻译：我们对分布式TD(0)和直线函数近似值提供了新的非被动分析。我们的方法依赖于“单点平均”, 美元代理商在当地复制TD(0), 平均结果在极端只运行一次。我们考虑两种模式: 一种是代理商与他们能够观测的环境相互作用,其过渡取决于他们的所有行动(我们称之为全球国家模式),另一种是每个代理商可以运行一个相同的Markov决定过程的本地副本,我们称之为当地州模式。在全球州模式中,我们显示,我们分布式单点平均方法的趋同率与已知的TD(0) 趋同率相匹配,相比之下,以往文献中的最佳趋同率则表明,在最坏的情况下,用美元(N3)美元作为代理商的数量,低于非分配式版本。在当地州模式中,我们展示了线性时间加速现象的版本,在这种模式中,分配式过程的趋同时间是比贸发10 的首次趋同时间的一个系数。相比之下,我们意识到,从这一平行方法的结果是,这种平行的。