Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.
翻译:传统的多臂老虎机(MAB)算法是为平稳环境设计的,其中各臂对应的奖励分布不随时间变化。然而,在许多实际应用中,环境更准确地建模为非平稳的。本文研究了分段平稳MAB(PS-MAB)环境,其中部分臂对应的奖励分布在特定变化点发生改变,并在变化点之间保持平稳。我们聚焦于PS-MAB的渐近分析,此前已有基于变化检测的实用算法被提出。我们的目标是将此类检测增强型老虎机(DAB)算法的设计与分析模块化。为此,我们首先为PS-MAB提供了新颖且改进的性能下界。接着,我们明确了在模块化DAB算法中,平稳老虎机算法与变化检测器所需满足的条件。我们假设奖励服从亚高斯分布。在此假设及变化点间隔条件的约束下,我们证明了DAB算法的分析确实可以模块化,从而能够以统一方式为不同变化检测器与老虎机算法的组合获得遗憾界。通过该分析,我们开发了新的模块化DAB算法,这些算法在阶数意义下是最优的。最后,我们在实验中展示了模块化DAB方法的实际有效性,通过与其他方法比较其遗憾性能并探究其检测能力。