We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.
翻译:我们调查的是,在不成功行动盛行的环境中,即所有行动都预设了执行期。在执行期间,环境变化受行动执行的影响,但与执行不同步。这种背景在许多现实世界问题中普遍存在。然而,大多数不成功行动方法假定行动是在推论后立即实施的,这往往不切实际,并可能导致与不成功行动进行多机构协调的灾难性失败。为了填补这一空白,我们为MARL制定了一个算法框架,我们然后提出一个新的缩略图记忆,即 LeGEM,用于无模式的MARL算法。 LeGEM利用代理人的个人经验构建了代理人的缩略图记忆。它通过我们新的奖励再分配计划解决非赢利行动引起的具有挑战性的时间信用分配问题,减轻非马尔科文行动的报酬问题。我们用不成功行动来评估多种多机构假设情景,包括Stag-Hunter 动作,我们提出新的缩略图记忆, LeGEM,用于无模式的MARMERM Bestal Game, 并大幅展示AFIAFBLA和AFRIGLADADLADLADLADLADLADLADA。