We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques. Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents. The source code and example videos can be found in our website: https://sites.google.com/view/staghuntrpg.
翻译:我们提出了一个简单、一般和有效的技术:在复杂的多试玩游戏中发现多种战略政策,奖励随机化和政策梯度相结合,我们得出一种新的算法,即奖励随机化和政策梯度(RPG)。 RPG能够在挑战时间信任困境中发现多种独特的人类解释战略,包括网格世界游戏和现实世界游戏Agar.io,在这个游戏中存在多种平衡,但标准的多试剂政策梯度算法总是会与固定的组合法相融合,每个玩家都得到亚优的回报,即使使用最先进的探索技术。此外,利用火箭榴弹的一套不同战略,我们可以(1) 微调最佳政策,从而获得更高的回报;(2) 利用这套战略作为培训对手,获得适应性代理人。源代码和示例视频可以在我们的网站上找到:https://sites.googlegle.com/view/staghhunrpg。