In this work, we study the system of interacting non-cooperative two Q-learning agents, where one agent has the privilege of observing the other's actions. We show that this information asymmetry can lead to a stable outcome of population learning, which generally does not occur in an environment of general independent learners. The resulting post-learning policies are almost optimal in the underlying game sense, i.e., they form a Nash equilibrium. Furthermore, we propose in this work a Q-learning algorithm, requiring predictive observation of two subsequent opponent's actions, yielding an optimal strategy given that the latter applies a stationary strategy, and discuss the existence of the Nash equilibrium in the underlying information asymmetrical game.
翻译:在这项工作中,我们研究互动的不合作的两个Q-学习代理器系统,其中,一个代理器有权观察对方的行动;我们表明,这种信息不对称可导致人口学习的稳定结果,一般而言,在一般独立学习者的环境中不会发生这种情况;由此产生的学习后政策在基本游戏意义上几乎是最佳的,即它们形成纳什平衡;此外,我们在此工作中建议一种Q-学习算法,要求预测观察后来两个对手的行动,考虑到后者采用固定战略,从而产生最佳战略,并讨论基本信息中是否存在纳什平衡不对称游戏。