Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this paper, we propose an exploration method that efficiently encourages cooperative exploration based on the idea of the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees). The high-level intuition is that to perform optimism-based exploration, agents would achieve cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. At each node (i.e., action) of the search tree, UCT performs optimism-based exploration using a bonus derived by conditioning on the visitation count of its parent node. We provide a perspective to view MARL as tree search iterations and develop a method called Conditionally Optimistic Exploration (COE). We assume agents take actions following a sequential order, and consider nodes at the same depth of the search tree as actions of one individual agent. COE computes each agent's state-action value estimate with an optimistic bonus derived from the visitation count of the state and joint actions taken by agents up to the current agent. COE is adaptable to any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.
翻译:高效的探索对于合作深度多智体强化学习(MARL)至关重要。在本文中,我们提出了一种基于理论上有保障的树搜索算法UCT(应用于树的置信上界)的探索方法,以有效地鼓励合作探索。高层次的直觉是,为了进行基于乐观的探索,如果每个代理的乐观估计捕获了与其他代理的结构依赖关系,代理将实现协作策略。在搜索树的每个节点(即动作)上,UCT使用基于乐观估计的探索,使用从其父节点的访问计数派生的奖励进行协作探索。我们提供了一种将MARL视为树搜索迭代的观点,并开发了一种称为Conditionally Optimistic Exploration (COE) 的方法。我们假设代理按照顺序采取行动,并将搜索树的相同深度的节点视为一个个体代理的动作。COE根据状态和代理到当前代理采取的联合动作的访问计数,计算每个代理的状态-动作值估计值与乐观奖励。COE适用于集中式训练与分散式执行的任何值分解方法。对各种合作MARL基准进行的实验表明,COE在难度探索任务上优于当前最先进的探索方法。