Traditional multi-agent reinforcement learning (MARL) algorithms, such as independent Q-learning, struggle when presented with partially observable scenarios, and where agents are required to develop delicate action sequences. This is often the result of the reward for a good action only being available after other agents have taken theirs, and these actions are not credited accordingly. Recurrent neural networks have proven to be a viable solution strategy for solving these types of problems, resulting in significant performance increase when compared to other methods. In this paper, we explore a different approach and focus on the experiences used to update the action-value functions of each agent. We introduce the concept of credit-cognisant rewards (CCRs), which allows an agent to perceive the effect its actions had on the environment as well as on its co-agents. We show that by manipulating these experiences and constructing the reward contained within them to include the rewards received by all the agents within the same action sequence, we are able to improve significantly on the performance of independent deep Q-learning as well as deep recurrent Q-learning. We evaluate and test the performance of CCRs when applied to deep reinforcement learning techniques at the hands of a simplified version of the popular card game Hanabi.
翻译:传统多剂强化学习(MARL)算法,如独立的Q-学习,在展示部分可观测情景时挣扎,需要代理商制定微妙的行动序列。这往往是奖励在其它代理商采取自己的行动后才能得到的良好行动的结果,而且这些行动也没有相应计入。经常性神经网络已证明是解决这类问题的可行解决方案战略,因此与其他方法相比,绩效显著提高。在本文件中,我们探索了不同的方法,侧重于更新每个代理商行动价值功能的经验。我们引入了信用-认知奖赏概念,使代理商能够察觉其行动对环境以及对其共同代理商的影响。我们表明,通过操纵这些经验和构建这些经验所含的奖赏,将所有代理商在同一行动序列内获得的奖赏包括在内,我们能够大大改进独立深层次的学习以及深层的复习。我们评估并测试了CRs在应用汉式的深层强化游戏大众学习卡时的表现。