In this paper, we consider a multi-armed bandit in which each arm is a Markov process evolving on a finite state space. The state space is common across the arms, and the arms are independent of each other. The transition probability matrix of one of the arms (the odd arm) is different from the common transition probability matrix of all the other arms. A decision maker, who knows these transition probability matrices, wishes to identify the odd arm as quickly as possible, while keeping the probability of decision error small. To do so, the decision maker collects observations from the arms by pulling the arms in a sequential manner, one at each discrete time instant. However, the decision maker has a trembling hand, and the arm that is actually pulled at any given time differs, with a small probability, from the one he intended to pull. The observation at any given time is the arm that is actually pulled and its current state. The Markov processes of the unobserved arms continue to evolve. This makes the arms restless. For the above setting, we derive the first known asymptotic lower bound on the expected time required to identify the odd arm, where the asymptotics is of vanishing error probability. The continued evolution of each arm adds a new dimension to the problem, leading to a family of Markov decision problems (MDPs) on a countable state space. We then stitch together certain parameterised solutions to these MDPs and obtain a sequence of strategies whose expected times to identify the odd arm come arbitrarily close to the lower bound in the regime of vanishing error probability. Prior works dealt with independent and identically distributed (across time) arms and rested Markov arms, whereas our work deals with restless Markov arms.
翻译:在本文中, 我们考虑一个多武装的匪徒, 每个手臂都是一个在有限的状态空间上演进的马可夫进程。 国家空间在武器之间很常见, 武器相互独立。 其中一个手臂( 奇臂) 的过渡概率矩阵与所有其他手臂的通用过渡概率矩阵不同。 一位知道这些过渡概率矩阵的决策者希望尽快识别奇怪的手臂, 同时将决策误差的概率维持在小处。 要做到这一点, 决策者通过连续地拉动武器来收集从武器中观察到的概率, 每一个更低的瞬间拉动武器。 然而, 判断器的间隔空间很常见, 而在任何特定时间实际拉动的手臂( 奇臂) 与他想要拉动的手臂( 奇臂) 的过渡概率不同。 在任何特定时间, 标记武器的马可继续演变过程。 这样, 标记武器的标记过程会变得不固定。 在上文的设置中, 我们发现第一个已知的隐蔽的更低的框框框框, 在预期的更低的时间里, 预设的弯的弯, 开始一个错误的顺序, 导致武器的顺序的顺序的顺序的顺序的顺序的顺序, 。