In this paper we present a model for the hidden Markovian bandit problem with linear rewards. As opposed to current work on Markovian bandits, we do not assume that the state is known to the decision maker before making the decision. Furthermore, we assume structural side information where the decision maker knows in advance that there are two types of hidden states; one is common to all arms and evolves according to a Markovian distribution, and the other is unique to each arm and is distributed according to an i.i.d. process that is unique to each arm. We present an algorithm and regret analysis to this problem. Surprisingly, we can recover the hidden states and maintain logarithmic regret in the case of a convex polytope action set. Furthermore, we show that the structural side information leads to expected regret that does not depend on the number of extreme points in the action space. Therefore, we obtain practical solutions even in high dimensional problems.
翻译:在本文中,我们展示了隐蔽的马尔科维亚盗匪问题的模型和线性奖赏。 与目前关于马尔科维亚土匪的工作相比,我们并不认为决策者在作出决定之前知道国家。 此外,我们假设了结构侧信息,决策者事先知道有两种类型的隐藏状态;一个是所有武器共有的,根据马克沃维亚分布而演变,另一个是每个手臂独有的,根据每个手臂独有的i. i.d. 进程进行分配。我们对此问题进行了算法和遗憾分析。令人惊讶的是,我们可以恢复隐藏的状态,并保持对立式的遗憾,以组合组合组合为例。此外,我们表明结构侧信息会导致预期的遗憾,而这种遗憾并不取决于行动空间的极端点数。 因此,我们获得实用的解决方案,即使是在高维度问题中。