Finding a best response policy is a central objective in game theory and multi-agent learning, with modern population-based training approaches employing reinforcement learning algorithms as best-response oracles to improve play against candidate opponents (typically previously learnt policies). We propose Best Response Expert Iteration (BRExIt), which accelerates learning in games by incorporating opponent models into the state-of-the-art learning algorithm Expert Iteration (ExIt). BRExIt aims to (1) improve feature shaping in the apprentice, with a policy head predicting opponent policies as an auxiliary task, and (2) bias opponent moves in planning towards the given or learnt opponent model, to generate apprentice targets that better approximate a best response. In an empirical ablation on BRExIt's algorithmic variants in the game Connect4 against a set of fixed test agents, we provide statistical evidence that BRExIt learns well-performing policies with greater sample efficiency than ExIt.
翻译:寻找最佳应对政策是游戏理论和多试剂学习的核心目标,现代基于人口的培训方法采用强化学习算法作为最佳反应或触手,以改善对候选对手的比赛(通常以前学习过的政策 ) 。 我们提出最佳反应专家迭代(BRExit),通过将对手模型纳入最先进的学习算法专家迭代(Exit),加快了游戏中的学习速度,从而加快了游戏中的学习速度。 BRExIt旨在(1) 改善学徒的特征塑造,政策负责人预测对手政策是一项辅助任务,以及(2) 偏见对手在规划中向给定或学习的对手模型移动,以产生更接近最佳反应的学徒目标。 在BRExIt在游戏中的算法变式4 与一组固定测试剂的实验中,我们提供了统计证据,证明BRExIST学会了比ExIT更高效的优秀政策。