对Whittle来说,指数性并不够:为无休止的强盗改进、近于最佳的比值 (Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits)

We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, \textit{even when} the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.

翻译：我们研究的是利用多种行动规划无休止的多武装匪徒(RMABs)的问题。这是一个多试剂系统流行的模式,其应用方式包括多渠道通信、监测和机器维护任务以及医疗保健。基于拉格兰加放松的惠特指数政策,在某些条件下由于简单和接近最佳,在这些环境中被广泛使用。在这项工作中,我们首先显示惠特尔指数政策在简单和实际相关的RMAB设置中可能失败,即使RMABs可以索引化。我们讨论了为什么最佳性能保证失败,以及为什么无止境的最佳性可能无法转化为实际相关的规划前景。我们然后提出了基于中位法的替代规划算法,这种算法可以灵活和高效地获得大量武器接近最佳的政策,而没有惠特尔指数政策所要求的严格结构假设。这借用了现有研究中的一些想法:我们的方法是超均位标准,我们提供了更好的非痛苦性分析,这种分析已经:(一个已知的低位基准比比值,表明我们更精确的比值是更精确的(一个已知的)实验性比值,这个比值的比值是更精确的比值。(一个已知的比值的比值是更精确的比值,这个比值是更精确的比值的比值是更精确的比值是更精确的比值,这个比值是更精确的比值,这个比值是更精确的比值是更精确的比值的比值。) 。