Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.
翻译:Pommerman是一个混合的合作/敌对多试剂环境,具有部分可观察性、有限或无交流、稀少和延迟的奖励以及限制性的计算时限等挑战性特点。这使得它成为强化学习(RL)方法的一个具有挑战性的环境。在本文中,我们侧重于从固定的基本政策(其本身受过训练,可以模仿吵闹的专家政策)开始,在10万场有限的计算预算中,为学习一项强有力和有希望的政策而制定课程,从固定的基本政策开始(它本身受过训练,可以模仿吵闹的专家政策)开始。所有从基础政策开始的RL算法都使用香草最佳政策(PPPO),具有同样的奖赏功能,它们之间唯一的差别是对手政策的组合和顺序。人们期望,开始对较简单的对手进行培训,然后逐步增加对手的困难,将促进更快的学习,导致较之于基线的更稳健的政策。我们测试这一假设,并表明,在有限的计算预算范围内,“在困难的学习中学习”更好,也就是说,与所有现有的对手政策几乎从一开始就会产生不同。我们研究爆炸剂的弹力的特性。我们还要研究在爆炸基底。