We consider a class of restless bandit problems that finds a broad application area in stochastic optimization, reinforcement learning and operations research. In our model, there are $N$ independent $2$-state Markov processes that may be observed and accessed for accruing rewards. The observation is error-prone, i.e., both false alarm and miss detection may happen. Furthermore, the user can only choose a subset of $M~(M<N)$ processes to observe at each discrete time. If a process in state~$1$ is correctly observed, then it will offer some reward. Due to the partial and imperfect observation model, the system is formulated as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with finite state spaces are PSPACE-HARD in general. In this paper, we establish a low-complexity algorithm that achieves a strong performance for this class of restless bandits. Under certain conditions, we theoretically prove the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in general.
翻译:我们考虑的是一组无休止的土匪问题,这些问题在随机优化、强化学习和操作研究中发现一个广泛的应用领域。 在我们的模式中,有2美元独立的马可夫进程可以被观察和获取以获得回报。 观察是容易出错的, 也就是说, 错误的警报和误测可能发生。 此外, 用户只能在每一个离散的时间里选择一个子集的 $M~ (M <N) 程序来观察。 如果一个进程在州里遵守了 $1, 那么它将提供一些奖励。 由于部分和不完善的观察模型, 系统被设计成一个无休止的多臂强盗问题, 其信息空间是不可计数的基点。 与有限的州空间无休止的强盗问题是一般的PACACE- HARD。 在本文中, 我们建立一个低相容的算法, 能为这种无休眠的强盗带来强的性能。 在某些条件下, 我们理论上证明惠特尔指数的存在( 指数) 及其与我们算法的等同性。 当这些条件不能维持时, 我们通过数字实验来展示我们一般的接近的算法。