Reinforcement learning (RL) is a general framework that allows systems to learn autonomously through trial-and-error interaction with their environment. In recent years combining RL with expressive, high-capacity neural network models has led to impressive performance in a diverse range of domains. However, dealing with the large state and action spaces often required for problems in the real world still remains a significant challenge. In this paper we introduce a new simulation environment, "Gambit", designed as a tool to build scenarios that can drive RL research in a direction useful for military analysis. Using this environment we focus on an abstracted and simplified room clearance scenario, where a team of blue agents have to make their way through a building and ensure that all rooms are cleared of (and remain clear) of enemy red agents. We implement a multi-agent version of feudal hierarchical RL that introduces a command hierarchy where a commander at the higher level sends orders to multiple agents at the lower level who simply have to learn to follow these orders. We find that breaking the task down in this way allows us to solve a number of non-trivial floorplans that require the coordination of multiple agents much more efficiently than the standard baseline RL algorithms we compare with. We then go on to explore how qualitatively different behaviour can emerge depending on what we prioritise in the agent's reward function (e.g. clearing the building quickly vs. prioritising rescuing civilians).
翻译:强化学习( RL) 是一个总体框架, 使各个系统能够通过试验和感性互动, 自主地学习与环境的实验和感官互动。 近年来, 将RL与直观、 高容量的神经网络模型相结合, 导致在一系列不同领域取得令人印象深刻的成绩。 然而, 处理现实世界中问题通常需要的庞大州和行动空间仍然是一个重大挑战。 在本文中, 我们引入一个新的模拟环境, “ Gambit ”, 设计成一个工具, 用来构建一个能够推动RL研究朝着军事分析方向前进的情景。 利用这个环境, 我们专注于一个抽象和简化的房间清理方案, 在那里, 一组蓝剂必须经过一个建筑, 并确保所有房间清除( 并且仍然清晰) 敌人红剂。 我们实施了一个多试版的封建等级RL, 引入一个指挥级别, 高级指挥官向低层次的多个代理人发出命令, 而这些代理人只需学习这些命令。 我们发现, 以这种方式打破任务, 使我们能够解决一系列非三流式的地面计划, 需要先对多个代理人进行快速的清理。 我们如何在前的排序上进行更高效的排序的排序比我们更高效地研究。