We propose a fully distributed actor-critic architecture, named Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep-neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.
翻译:我们提出一个分布齐全的行动者-批评架构,名为Diff-DAC,用于多任务强化学习(MRL)。在学习过程中,代理商向邻居传达其价值和政策参数,在不需要中央站的代理商网络中传播信息。每个代理商只能从当地任务中获取数据,但目的是学习一个对整个任务组运行良好的共同政策。这个架构是可以伸缩的,因为每个代理商的计算和通信成本取决于邻国的数目,而不是代理人的总数。我们从双重性理论中得出Diff-DAC,并提供关于其行为者-批评性框架的新见解,表明这实际上是一个双重性方法的例子。我们几乎证明Diff-DAC在一般假设下与一项共同政策趋于一致,即使这种假设对于深神经网络的近似关系也是如此。对于更具限制性的假设,我们还证明这一共同政策是原始问题近似的固定点。多任务扩展的共同持续控制基准的数值结果表明,Diff-DAC稳定化学习和常规化结构比以往更经常化的特性产生更高的业绩和常规化效果。