Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as $t\rightarrow{}\infty$. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as $t\rightarrow{}\infty$. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.
翻译:多行动、无休止的多武装匪徒(RMABs)是限制资源分配的强大框架,管理着美元独立的流程。然而,先前的工作只是研究已知问题动态的离线设置。我们处理这一限制性假设,设计第一个算法,以学习多种行动RMABs在线的良好政策,使用Lagrangian放松和Q学习的组合。我们的第一个方法,MAIQL,将二进行动RMABs中Whittle指数的Q学习方法推广到多行动设置。我们得出了一个通用更新规则和趋同的证明,并确定了在标准假设下,MAIQL(MAQL)将多行动的最佳多行动多行动多行动RMAB政策归结为“$trightorlooral”infty 。然而,MAIQL(MAB)依赖两个时间尺度学习Q-功能和指数,这导致趋同速度速度的缓慢,需要很好地执行问题结构。因此,我们设计了第二个算法,LPQL(LQL)的第二个算法,它从运行良好和更一般的LQLA-LLLL) 返回政策,我们学习多行动精确的精确的回归策略, 通过学习一个方向, 通过学习一个方向,让我们学习一个最精确的路径, 学习一个方向的路径, 通过学习,让我们学习一个最精确的排序的路径, 通过学习 学习到最精确的路径, 学习一个。