In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP) where the dynamics of neighboring agents are coupled. We use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently, based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of Q-function on other agents decreases exponentially as the distance between them increases. Furthermore, the complexity of DGRM is related to the local information size of the largest $\kappa$-hop neighborhood, and DGRM can find an $O(\rho^{\kappa+1})$-approximation of a stationary point of the objective function. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by two case studies, UAV package delivery and COVID-19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID-19 pandemic mitigation.
翻译:在多试剂强化学习(MARL)中,对收集代理商来说,学习复杂的时间延长任务是一项艰巨的任务。困难在于计算复杂性以及如何学习奖励功能背后的高层次想法。我们研究了基于图形的Markov 决策程序(MDP),该程序将邻近代理商的动态结合在一起。我们使用一个奖赏机器(RM)来编码每个代理商的任务并暴露奖赏功能的内部结构。RM有能力描述高层次的知识并编码非马尔科维安奖赏功能。我们建议采用分散的学习算法来处理计算复杂性,即使用奖赏机器(DGRM)进行分散的图形强化学习,使每个代理商都具备本地化政策,允许代理商根据代理商掌握的信息独立决策。DGRM使用演算结构来记录每个代理商的任务,我们引入表质的Q-功能。我们显示,Q-功能对其它代理商的拟议D-软化功能的依赖随着它们之间的距离增加而指数下降。此外,DGRM的复杂程度与最大的美元-19美元(DRGRR)的本地信息规模有关,我们使用一个直径GRM的直径直径表示一个直径GRRM的直径GRRD的直径分析功能。