We study a multi-armed bandit problem where the rewards exhibit regime switching. Specifically, the distributions of the random rewards generated from all arms are modulated by a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the transition matrix and the reward distributions. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, belief error control in partially observable Markov decision processes and upper-confidence-bound methods for online learning. We also establish an upper bound $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. Finally, we conduct proof-of-concept experiments to illustrate the performance of the learning algorithm.
翻译:我们研究的是多武装的盗匪问题,因为奖励展览制度可以转换。具体地说,所有武器产生的随机奖赏的分配都由一个共同的基本国家调节,这种国家模式是有限的马尔科夫链。代理人不观察基本状态,必须学习过渡矩阵和奖赏分配。我们提出这个问题的学习算法,以隐藏的马尔科夫模型的光谱方法估计、部分可见的马尔科夫决策程序中的信念错误控制以及在线学习的具有高度信心的方法为基础。我们还为拟议的学习算法建立了上限值为O(T ⁇ 2/3 ⁇ sqrt~log T),其中$T是学习的视野。最后,我们进行概念验证实验,以说明学习算法的绩效。