Traditionally, learning from human demonstrations via direct behavior cloning can lead to high-performance policies given that the algorithm has access to large amounts of high-quality data covering the most likely scenarios to be encountered when the agent is operating. However, in real-world scenarios, expert data is limited and it is desired to train an agent that learns a behavior policy general enough to handle situations that were not demonstrated by the human expert. Another alternative is to learn these policies with no supervision via deep reinforcement learning, however, these algorithms require a large amount of computing time to perform well on complex tasks with high-dimensional state and action spaces, such as those found in StarCraft II. Automatic curriculum learning is a recent mechanism comprised of techniques designed to speed up deep reinforcement learning by adjusting the difficulty of the current task to be solved according to the agent's current capabilities. Designing a proper curriculum, however, can be challenging for sufficiently complex tasks, and thus we leverage human demonstrations as a way to guide agent exploration during training. In this work, we aim to train deep reinforcement learning agents that can command multiple heterogeneous actors where starting positions and overall difficulty of the task are controlled by an automatically-generated curriculum from a single human demonstration. Our results show that an agent trained via automated curriculum learning can outperform state-of-the-art deep reinforcement learning baselines and match the performance of the human expert in a simulated command and control task in StarCraft II modeled over a real military scenario.
翻译:传统上,通过直接行为克隆从人类演示中学习,可以导致高性能政策,因为算法可以获得大量高质量的数据,涵盖该代理人运作时最有可能遇到的情景。然而,在现实世界的情景中,专家数据是有限的,它希望培训一名代理人,学会一种行为政策,足以处理人类专家没有证明的情况。另一个替代办法是通过深层强化学习,在没有监督的情况下学习这些政策,但是,这些算法需要大量计算时间,才能很好地完成具有高度状态和行动空间的复杂任务,如StarCraft II中的数据。 自动课程学习是一个最新机制,由各种技术组成,目的是通过调整当前任务的难度,根据该代理人目前的能力加以解决,从而加快深入的增援学习。然而,设计适当的课程,对于足够复杂的任务可能具有挑战性,因此我们利用人类演示作为培训模式中指导代理人探索的一种方法。在这项工作中,我们的目标是培训深重的加固学习人员,他们能够指挥多种不同角色,例如Starft II中出现的职位和任务的总体困难。 自动化课程学习由自动生成的精细级课程来控制。我们通过一个经过深级模型测试的建筑模型测试的人体测试的建筑模型任务,可以展示,在高级模型中学习模型的进度中的一项业绩。