We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least $k$ chains to reach their target states. If the player decides to play a different chain, an additional switching cost is incurred. The special case in which there is no switching cost was solved optimally by Dumitriu, Tetali, and Winkler~\cite{DTW03} by a variant of the celebrated Gittins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards \cite{Git74,Git79}. However, for Markovian multi-armed bandit with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram \cite{BS94} showed that no index strategy can be optimal. In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and $k=1$. To the best of our knowledge, this index strategy is the first strategy that achieves a constant approximation factor for a general Markovian MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via a nontrivial reduction to the stochastic $k$-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of the Gittins index.
翻译:我们研究了一个通用的Markov游戏,其成本是衡量标准转换成本:在每一回合中,玩家都适应性地选择了几个Markov链条中的一个来推进,目标是将至少1美元链条的预期成本降到最低,以达到目标状态。如果玩家决定玩不同的链,就会产生额外的转换成本。没有转换成本的特殊案例,由Dumitriu、Tetali和Winkler ⁇ cite{DTW03}通过一个为典型的多臂土匪的Gittins指数(MAB)所庆祝的变体来解决。在马可维尼奖项奖励中,选择了其中之一。然而,对于使用非三重转换成本的Markovian多臂土匪来说,即使开关成本不变,也会产生额外的转折价。 银行和Sundaram{cite{BS94} 的经典论文表明,没有哪个指数战略是最佳的。在本论文中,我们补充了它们的结果,并展示了一个简单的指数战略,如果转换成本是固定的, $=1美元。对于我们一个不变的变价战略来说,我们的一个不变的变动总的变数战略是用来降低一个不变的。