This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish the differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of previous decision-focused learning approaches in sequential problems; (iii) we apply our algorithm to the service call scheduling problem on a real-world maternal and child health domain. Our algorithm is the first for decision-focused learning in RMAB that scales to large-scale real-world problems. \end{abstract}
翻译:本文研究的是无休止的多武装土匪(RMAB)问题,它们具有未知的手臂过渡动态,但有已知的相关手臂特征。目的是学习一种模型,以预测过渡动态的特征,其中惠特尔指数政策利用预测的过渡转型政策解决了RMAB问题。然而,先前的工作往往通过最大限度地提高预测准确度而不是最后的RMAB解决方案质量来学习模型,从而造成培训和评估目标之间的不匹配。为了解决这一缺陷,我们提出了一种新颖的RMAB以决策为重点的学习方法,直接培训预测模型,以最大限度地提高惠特尔指数解决方案的质量。我们提出了三个主要贡献:(一) 我们确定惠特尔指数政策的差异性,以支持以决策为重点的学习;(二) 我们大幅提高以往以决策为重点的学习方法在相继问题中的可扩展性;(三) 我们将我们的算法应用于服务,要求将时间安排问题放在现实世界的妇幼保健领域。我们的算法是第一个在RMAB中以决策为重点的学习,以大规模现实世界问题为尺度。\ {strampt}