The goal of agents in multi-agent environments is to maximize total reward against the opposing agents that are encountered. Following a game-theoretic solution concept, such as Nash equilibrium, may obtain a strong performance in some settings; however, such approaches fail to capitalize on historical and observed data from repeated interactions against our opponents. Opponent modeling algorithms integrate machine learning techniques to exploit suboptimal opponents utilizing available data; however, the effectiveness of such approaches in imperfect-information games to date is quite limited. We show that existing opponent modeling approaches fail to satisfy a simple desirable property even against static opponents drawn from a known prior distribution; namely, they do not guarantee that the model approaches the opponent's true strategy even in the limit as the number of game iterations approaches infinity. We develop a new algorithm that is able to achieve this property and runs efficiently by solving a convex minimization problem based on the sequence-form game representation using projected gradient descent. The algorithm is guaranteed to efficiently converge to the opponent's true strategy under standard Bayesian identifiability and visitation assumptions, given observations from gameplay and possibly additional historical data if it is available.
翻译:在多智能体环境中,智能体的目标是通过对抗所遭遇的对手智能体来最大化总奖励。遵循博弈论解概念(如纳什均衡)在某些设定下可能获得较强的性能;然而,此类方法未能充分利用从与对手的重复交互中获得的历史数据和观测数据。对手建模算法整合机器学习技术,利用可用数据来利用次优对手;然而,迄今为止,此类方法在非完美信息博弈中的有效性相当有限。我们证明,即使针对从已知先验分布中抽取的静态对手,现有对手建模方法也无法满足一个简单的理想性质;即,即使当博弈迭代次数趋近于无穷大时,它们也不能保证模型趋近于对手的真实策略。我们开发了一种新算法,该算法能够通过基于序列形式博弈表示,使用投影梯度下降法求解凸最小化问题,从而高效地实现这一性质。在标准的贝叶斯可识别性和访问假设下,给定来自游戏过程的观测数据以及可能获得的额外历史数据,该算法能够保证高效地收敛到对手的真实策略。