配有切换成本的 Markov 游戏 (Markov Game with Switching Costs)

We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least $k$ chains to reach their target states. If the player decides to play a different chain, an additional switching cost is incurred. The special case in which there is no switching cost was solved optimally by Dumitriu, Tetali, and Winkler [DTW03] by a variant of the celebrated Gittins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards [Gittins 74, Gittins79]. However, for multi-armed bandit (MAB) with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram [BS94] showed that no index strategy can be optimal. In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and $k=1$. To the best of our knowledge, this is the first index strategy that achieves a constant approximation factor for a general MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via a nontrivial reduction to the stochastic $k$-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of the Gittins index.

翻译：我们研究了一个通用的Markov游戏,其成本为标准转换成本:在每一回合中,玩家都适应性地选择了几个Markov 链条中的一个来推进Markov 奖赏[Gittins 74, Gittins79]。然而,如果玩家决定玩一个不同的链,就会产生额外的转换成本。如果玩家决定玩一个不同的链条,那么,就会产生额外的转换成本。没有转换成本的特殊案例,Dumitriu、Tetali和Winkler[DTW03]以一个备效的Gittins指数变量解决了。对于古典多臂匪盗(MAB)的奖赏问题[Gittins 74, Gittins79]。然而,对于多臂匪队(MAB)以非三重转换成本到达目标状态的预期成本。即使开关成本是固定不变的,银行和Sundarram[BS94]的经典论文表明,没有哪个指数战略是最佳的。在本文中,我们有一个简单的指数战略,如果转换成本不变和美元。。对于我们的知识来说,这是通过一个不固定的变式的变式的变式战略,我们总的变式的变式的变式的变式的变式的变式的变式的GLF 。

相关内容

马尔可夫链

关注 289

马尔可夫链，因安德烈·马尔可夫（A.A.Markov，1856－1922）得名，是指数学中具有马尔可夫性质的离散事件随机过程。该过程中，在给定当前知识或信息的情况下，过去（即当前以前的历史状态）对于预测将来（即当前以后的未来状态）是无关的。在马尔可夫链的每一步，系统根据概率分布，可以从一个状态变到另一个状态，也可以保持当前状态。状态的改变叫做转移，与不同的状态改变相关的概率叫做转移概率。随机漫步就是马尔可夫链的例子。随机漫步中每一步的状态是在图形中的点，每一步可以移动到任何一个相邻的点，在这里移动到每一个点的概率都是相同的（无论之前漫步路径是如何的）。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

111+阅读 · 2020年5月15日

【干货书】Python深度学习第二版，Deep Learning with Python, Second Edition

专知会员服务

168+阅读 · 2020年5月9日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日