This paper studies a class of constrained restless multi-armed bandits (CRMAB). The constraints are in the form of time varying set of actions (set of available arms). This variation can be either stochastic or semi-deterministic. Given a set of arms, a fixed number of them can be chosen to be played in each decision interval. The play of each arm yields a state dependent reward. The current states of arms are partially observable through binary feedback signals from arms that are played. The current availability of arms is fully observable. The objective is to maximize long term cumulative reward. The uncertainty about future availability of arms along with partial state information makes this objective challenging. Applications for CRMAB can be found in resource allocation in cyber-physical systems involving components with time varying availability. First, this optimization problem is analyzed using Whittle's index policy. To this end, a constrained restless single-armed bandit is studied. It is shown to admit a threshold-type optimal policy and is also indexable. An algorithm to compute Whittle's index is presented. An alternate solution method with lower complexity is also presented in the form of an online rollout policy. A detailed discussion on the complexity of both these schemes is also presented, which suggests that online rollout policy with short look ahead is simpler to implement than Whittle's index computation. Further, upper bounds on the value function are derived in order to estimate the degree of sub-optimality of various solutions. The simulation study compares the performance of Whittle's index, online rollout, myopic and modified Whittle's index policies.
翻译:本文研究一组受限制的不固定多武装匪徒( CRMAB) 。 限制的形式是时间差异化的行动( 可用武器组) 。 这种差异可以是随机的, 也可以是半确定性的。 根据一套武器组, 可以在每个决定间隔中选择固定数量。 每个手臂的玩耍可以产生一个取决于状态的奖赏。 目前的武器状态通过所玩武器的二进制反馈信号来部分观察。 目前的武器供应情况是完全可见的。 目标是最大限度地增加长期累积奖励。 与部分状态信息一起, 未来武器供应的不确定性使得这个目标具有挑战性。 这种差异既可以是随机的,也可以是半决定性的。 在涉及时间差异的组件的网络物理系统中, 可以找到对 CRMAB 的应用程序。 首先, 利用惠特尔的指数政策政策来分析优化问题。 为此, 将研究一个不固定的单臂强的受限制的单臂强力。 这表明可以接受一种最优的门槛式最佳政策, 并且也可以完全地观察。 计算惠特的指数的算法 。 一种复杂程度的替代的解决方案方法, 也可以在网上推算方法的形式, 。 在在线的推算方法上, 的推算方法的精度的精度的精度的精度计算。