We study optimal resource allocation in restless multi-armed bandits (RMABs) under unknown and non-stationary dynamics. Solving RMABs optimally is PSPACE-hard even with full knowledge of model parameters, and while the Whittle index policy offers asymptotic optimality with low computational cost, it requires access to stationary transition kernels - an unrealistic assumption in many applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Our algorithm achieves a dynamic regret of $\tilde O(T^{2/3}\tilde V^{1/3}+T^{4/5})$ for large RMABs, where $T$ is the number of episodes and $\tilde V$ is the total variation distance between consecutive transition kernels. Importantly, we handle the challenging case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. Window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.
翻译:我们研究了在未知且非平稳动态下不安定多臂老虎机(RMAB)中的最优资源分配问题。即使完全已知模型参数,最优求解RMAB也是PSPACE难的,而Whittle索引策略虽能以较低计算成本提供渐近最优性,却要求使用平稳转移核——这在许多实际应用中是不现实的假设。为应对这一挑战,我们提出了一种滑动窗口在线Whittle(SW-Whittle)策略,该策略在适应时变核的同时保持计算高效性。对于大规模RMAB,我们的算法实现了$\tilde O(T^{2/3}\tilde V^{1/3}+T^{4/5})$的动态遗憾界,其中$T$为回合数,$\tilde V$为连续转移核之间的总变差距离。重要的是,通过将Bandit-over-Bandit框架与我们的滑动窗口设计相结合,我们处理了变化预算事先未知的挑战性情况。窗口长度根据估计的变化量在线调整,而Whittle指数则通过估计转移核的上置信界及双线性优化程序计算。数值实验表明,我们的算法在多种非平稳环境中始终优于基线方法,实现了最低的累积遗憾。