Game theory provides essential analysis in many applications of strategic interactions. However, the question of how to construct a game model and what is its fidelity is seldom addressed. In this work, we consider learning in a class of repeated zero-sum games with unknown, time-varying payoff matrix, and noisy feedbacks, by making use of an ensemble of benchmark game models. These models can be pre-trained and collected dynamically during sequential plays. They serve as prior side information and imperfectly underpin the unknown true game model. We propose OFULinMat, an episodic learning algorithm that integrates the adaptive estimation of game models and the learning of the strategies. The proposed algorithm is shown to achieve a sublinear bound on the saddle-point regret. We show that this algorithm is provably efficient through both theoretical analysis and numerical examples. We use a dynamic honeypot allocation game as a case study to illustrate and corroborate our results. We also discuss the relationship and highlight the difference between our framework and the classical adversarial multi-armed bandit framework.
翻译:游戏理论在许多战略互动应用中提供了基本分析。 但是, 如何构建游戏模型及其忠实性的问题很少被讨论。 在这项工作中, 我们考虑在一系列重复的零和游戏中学习, 使用一系列基准游戏模型, 使用一系列基准游戏模型。 这些模型可以在连续游戏中预先训练并动态收集。 这些模型可以作为前侧信息, 不完善地支持未知的真实游戏模型。 我们提出OUFLINMat, 这是一种将游戏模型的适应性估计和战略的学习结合起来的直观学习算法。 提议的算法显示可以实现马鞍式遗憾的亚线性约束。 我们通过理论分析和数字示例显示, 这种算法非常有效。 我们用一个动态的蜂蜜分配游戏作为案例研究, 来说明和证实我们的成果。 我们还讨论我们的框架与经典对立式多臂强力框架之间的关系并突出差异。