We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.
翻译:我们研究的是利用多种行动规划无休止的多武装匪徒(RMABs)的问题。这是一个多试剂系统流行的模式,其应用方式包括多通道通信、监测和机器维护任务以及保健。基于拉格兰加放松的惠特指数政策,在某些条件下由于简单和接近最佳,在这些环境中被广泛采用。在这项工作中,我们首先表明,Whittle指数政策在简单和实用相关的RMAB环境中可能失败,即使RMABs可以索引化。我们讨论了为什么最佳性保证失败,以及为什么非现代性最佳性可能不能转化为实际相关的规划前景。我们随后建议了一种基于平均值方法的替代规划算法,这种算法可以灵活和高效地获得大量武器接近最佳的政策,而没有惠特尔指数政策所要求的严格的结构假设。这借用了现有研究中的一些想法:我们的方法是无超标准基准的,我们提供了改进的非补救性分析,而这种分析已经表明:(a) 没有关于外部高度精确的比值方法的要求;因此,我们所知道的高度的高度精确的比值是精确的比值。</s>