We propose a fully distributed actor-critic algorithm approximated by deep neural networks, named \textit{Diff-DAC}, with application to single-task and to average multitask reinforcement learning (MRL). Each agent has access to data from its local task only, but it aims to learn a policy that performs well on average for the whole set of tasks. During the learning process, agents communicate their value-policy parameters to their neighbors, diffusing the information across the network, so that they converge to a common policy, with no need for a central node. The method is scalable, since the computational and communication costs per agent grow with its number of neighbors. We derive Diff-DAC's from duality theory and provide novel insights into the standard actor-critic framework, showing that it is actually an instance of the dual ascent method that approximates the solution of a linear program. Experiments suggest that Diff-DAC can outperform the single previous distributed MRL approach (i.e., Dist-MTLPS) and even the centralized architecture.
翻译:我们提出一个完全分布式的由深层神经网络(称为\textit{Diff-DAC})所近似于的行为者-批评算法,该算法应用到单任务和平均多任务强化学习(MRL)中。每个代理商只能从当地任务中获取数据,但该算法的目的是学习一个在全套任务中平均运行良好的政策。在学习过程中,代理商向邻居传达其价值政策参数,在整个网络中传播信息,以便他们聚集到一个共同的政策中,不需要一个中心节点。这个方法可以推广,因为每个代理商的计算和通信费用随着其邻居人数的增加而增加。我们从双重理论中得出Diff-DAC的理论,并对标准的行为者-批评框架提供新的洞见,表明它实际上是接近线性方案解决方案的双倍分法的例子。实验表明,Diff-DAC可以超越以往单一的分布式MRL方法(即DST-MT-MTLPS),甚至中央结构。