Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex \textit{Google Research Football} environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in \textit{Google Research Football}.
翻译:衡量和促进政策多样性对于在战略周期存在且没有一贯赢家的情况下,以强有力的非透明动态解决具有战略周期且没有一贯赢家的游戏(如 Rock-Paper-sclictors { ) 计量和促进政策多样性至关重要。 牢记这一点,通过开放学习来保持一套多样化政策是一个有吸引力的解决办法,可以产生自动曲线以避免被利用。 但是,在传统的开放学习算法中,没有广泛接受的多样化定义,因此很难构建和评价多样化政策。 在这项工作中,我们总结了以往的多样化概念,并努力在多代理人开放式学习中提供统一的多样性衡量标准,将所有元素都包括在Markov游戏中(如Rock-Paper-Sclictors)。 在轨迹分布上,我们可以重新定义在状态行动空间中的BDD值,作为占用量度的尺度。对于奖赏性动力,我们建议RD在遇到不同对手时,通过政策的反应来描述多样性。我们目前的许多多样性计量标准属于BD类或RD的类别,但不包含RD- 马尔科夫游戏中的所有元素。 在研究中,我们用最接近的多样化的方法,我们最接近的变式研究方法来显示人口多样化,在研究中,我们最接近的变异化的变化的变化的变式研究方法中,我们以相对的变化的变化的变化的变式的变式的变。