This paper proposes a \emph{fully asynchronous} scheme for the policy evaluation problem of distributed reinforcement learning (DisRL) over directed peer-to-peer networks. Without waiting for any other node of the network, each node can locally update its value function at any time by using (possibly delayed) information from its neighbors. This is in sharp contrast to the gossip-based scheme where a pair of nodes concurrently update. Though the fully asynchronous setting involves a difficult multi-timescale decision problem, we design a novel stochastic average gradient (SAG) based distributed algorithm and develop a push-pull augmented graph approach to prove its exact convergence at a linear rate of $\mathcal{O}(c^k)$ where $c\in(0,1)$ and $k$ increases by one no matter on which node updates. Finally, numerical experiments validate that our method speeds up linearly with respect to the number of nodes, and is robust to straggler nodes.
翻译:本文为分布式强化学习( DisRL) 分布式强化学习( DisRL) 的分布式同伴对同伴网络的政策评估问题提出了一个计划。 在不等待网络的其他节点的情况下, 每个节点可以随时通过使用来自邻居的信息( 可能延迟) 更新其值函数。 这与基于八卦的计划截然不同, 即双节点同时更新。 虽然完全不同步的设置涉及一个困难的多时级决定问题, 我们设计了一个基于分布式平均梯度( SAG) 的新颖的随机平均梯度( SAG), 并开发了一种推式增强的图形方法, 以证明它以$\ mathcal{O}( cäk) 的线性速率准确趋近, 以 $\ mathc>1$( ) 和 $k$( $k$) 为单位, 每增加一个不更新的事物。 最后, 数字实验证实我们的方法在节点数量上以线性速度加快, 并且对 stragglerndes 。