合作多机构强化学习的分布式价值值函数近似值 (Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning)

In this paper we propose several novel distributed gradient-based temporal difference algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes with strict information structure constraints, limiting inter-agent communications to small neighborhoods. The algorithms are composed of: 1) local parameter updates based on single-agent off-policy gradient temporal difference learning algorithms, including eligibility traces with state dependent parameters, and 2) linear stochastic time varying consensus schemes, represented by directed graphs. The proposed algorithms differ by their form, definition of eligibility traces, selection of time scales and the way of incorporating consensus iterations. The main contribution of the paper is a convergence analysis based on the general properties of the underlying Feller-Markov processes and the stochastic time varying consensus model. We prove, under general assumptions, that the parameter estimates generated by all the proposed algorithms weakly converge to the corresponding ordinary differential equations (ODE) with precisely defined invariant sets. It is demonstrated how the adopted methodology can be applied to temporal-difference algorithms under weaker information structure constraints. The variance reduction effect of the proposed algorithms is demonstrated by formulating and analyzing an asymptotic stochastic differential equation. Specific guidelines for communication network design are provided. The algorithms' superior properties are illustrated by characteristic simulation results.

翻译：在本文中,我们提出若干新的分布式基于梯度的时间差异算法,用于多试剂离政策学习Markov决策程序中价值函数的线性近似,并有严格的信息结构限制,将代理人之间的通信限制在小社区。算法包括:1)基于单剂离政策梯度时间差异学习算法的本地参数更新,包括资格根据参数的跟踪,以及2)由定向图表代表的线性随机时间差异计算法。拟议的算法因其形式、资格跟踪定义、时间尺度选择和采用协商一致迭代的方式而有所不同。文件的主要贡献是根据Feller-Markov程序的一般特性以及不同的共识模型进行的趋同分析。根据一般假设,我们证明所有拟议算法产生的参数估计数与相应的普通差异方程(ODE)不甚一致,且有精确的变量组合。说明了在较弱的信息结构制约下,如何将采用采用的方法应用于时间差异方差算法。拟议算法的差异减少效果是通过制定和分析一个典型的网络模型分析结果来显示,通过制定和分析一个典型的模型分析模型分析结果。