We consider a class of restless bandit problems that finds a broad application area in stochastic optimization, reinforcement learning and operations research. We consider $N$ independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (`good' and `bad'). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only $M$ $(<N)$ processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time $t$, a probability that process $i$ is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Last, we theoretically prove the optimality of our algorithm for homogeneous systems.
翻译:我们考虑的是一类无休止的土匪问题,在随机优化、强化学习和业务研究中发现一个广泛的应用区。我们考虑的是独立的离散时间马可夫进程,每个进程都有两个可能的状态:1美元和0美元(“好”和“坏”)。只有当一个进程同时处于1国并被观察,才会有回报。我们的目标是在无限的地平线上最大限度地增加预期的回报折扣总和,但在每个步骤中只能观察到只有$(<N)美元的程序。观察是容易出错的:已知的概率是1(0)将观察到0(1)。我们从这里知道,随时可能有一个1美元(“好”和“坏”两个可能的状态)。因此产生的系统可能会被建成一个无休止的多架土匪问题,其信息空间是不可计数的。即使是有限的州平流空间也是一般的。我们提出了一种新颖的方法来简化这一无休止空间的马达1(0)状态的编程方程式方程式的概率,从任何时间里程观察,我们通过一个不易变的直径的直径的直径的直径直径的算法的算法。