In the regret-based formulation of Multi-armed Bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with i.i.d. rewards. In this paper, we consider the problem of obtaining regret guarantees for MAB problems in which the rewards of each arm form a Markov chain which may not belong to a single parameter exponential family. To achieve logarithmic regret in such problems is not difficult: a variation of standard Kullback-Leibler Upper Confidence Bound (KL-UCB) does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a total variation distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regret for both i.i.d. and Markovian settings.
翻译:在多武装盗匪(MAB)问题的基于遗憾的提法中,除了罕见的情况外,许多文献都集中在武器上,以一.d.奖励作为奖励。在本文中,我们考虑了如何为MAB问题获得遗憾保证的问题,在这些问题上,每只手臂的奖赏形成一个可能不属于单一参数指数式家族的Markov链条。要在这些问题上实现对论的遗憾并不困难:对Kullback-Leibel Unible Infority Bound(KL-UCB)的标准调换工作。然而,从这种分析中获得的常数之所以差,原因如下:i.i.d.奖励是Markov的特殊例子,很难设计出一种与基本模型是否真正属于Markovian或i.i.d.不相独立的算法。为了克服这个问题,我们采用了一种新型的算法,用完全变换的远程测试来确定每只臂的奖赏是否真正属于Markovian或i.i.i.d。我们的算法随后将使用标准的KL-UCB奖赏转换为K-L-UCB的专业化版本。因此,Markov和Markban的奖状都决定是Mark的。