We consider a dynamic Colonel Blotto game (CBG) in which one of the players is the learner and has limited troops (budget) to allocate over a finite time horizon. At each stage, the learner strategically determines the budget and its distribution to allocate among the battlefields based on past observations. The other player is the adversary, who chooses its budget allocation strategies randomly from some fixed but unknown distribution. The learner's objective is to minimize the regret, which is defined as the difference between the optimal payoff in terms of the best dynamic policy and the realized payoff by following a learning algorithm. The dynamic CBG is analyzed under the framework of combinatorial bandit and bandit with knapsacks. We first convert the dynamic CBG with the budget constraint to a path planning problem on a graph. We then devise an efficient dynamic policy for the learner that uses a combinatorial bandit algorithm Edge on the path planning graph as a subroutine for another algorithm LagrangeBwK. A high-probability regret bound is derived, and it is shown that under the proposed policy, the learner's regret in the budget-constrained dynamic CBG matches (up to a logarithmic factor) that of the repeated CBG without budget constraints.
翻译:我们考虑的是充满活力的上校布洛托游戏(CBG ), 其中一个玩家是学习者, 并且有有限的部队(预算) 来分配有限的时间范围。 在每一个阶段, 学习者都根据以往的观察从战略上决定预算及其在战场之间的分配。 另一个玩家是对手, 他随机地从某些固定但未知的分配中选择其预算分配战略。 学习者的目标是将遗憾降到最低程度, 也就是根据学习算法, 最佳的动态政策与实现的回报之间的差别。 动态的CBG 是在组合式的土匪和用 knapsacks 来分析的框架之下分析的。 我们首先用预算限制将动态的CBG 转换为图表上的一个路径规划问题。 然后我们为学习者设计一个高效的动态政策, 该学习者在路径规划图上使用组合式的带算法 Edge作为另一种算法的子路程。 高概率的后悔被推算出, 并且显示, 在拟议的政策下, 学习者会将预算限制的逻辑比重的G 。