We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.
翻译:我们研究了针对具有长期平均奖励的不安定多臂老虎机问题的拉格朗日指数策略。具体而言,我们将LIP的性能与Whittle指数策略的性能进行了比较,这两种启发式策略在特定自然条件下均已知是渐近最优的。尽管在大多数情况下它们的性能非常相似,但在WIP表现不佳的情况下,LIP仍然能保持优异的性能。随后,我们提出了基于表格和神经网络的强化学习算法,以在无模型设置下为LIP实现在线学习方案。所提出的LIP强化学习方案比类似的WIP方案所需的内存显著减少。我们解析计算了重启模型的拉格朗日指数,该模型适用于最优网络爬取和加权信息年龄最小化问题。此外,基于可交换性和de Finetti定理,我们为同质臂情形(当臂的数量趋于无穷时)的渐近最优性提供了一个新的证明。