Multi-armed bandit (MAB) problems are mainly studied under two extreme settings known as stochastic and adversarial. These two settings, however, do not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study a dynamic MAB problem with stochastic temporal structure, where the expected reward of each arm is governed by an auto-regressive (AR) model. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to be explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong (dynamic) benchmark. We then present an algorithm whose per-round regret almost matches our regret lower bound. Our algorithm relies on two mechanisms: (i) alternating between recently pulled arms and unpulled arms with potential, and (ii) restarting. These mechanisms enable the algorithm to dynamically adapt to changes and discard irrelevant past information at a suitable rate. In numerical studies, we further demonstrate the strength of our algorithm under different types of non-stationary settings.
翻译:多武装土匪(MAB)问题主要在两种称为随机和对抗性的极端环境下研究,但这两种环境并不反映搜索引擎、营销和广告等现实环境,这些环境在时间上带来不切实际的变化。因此,我们提出并研究一个动态的MAB问题,涉及随机时间结构,每个手臂的预期报酬由自动递减模式(AR)管理。由于奖励的动态性质,简单的“爆炸和承诺”政策失败,因为所有武器都必须不断探索。我们通过将遗憾按强力(动态)基准衡量的每回合较低界限定性,从而正式确定这一点。然后我们提出一种算法,其每回合的遗憾几乎与我们较少的遗憾相符。我们的算法依赖于两个机制:(一) 将最近拉动的军火与潜在无阻断的军火交替在一起,以及(二) 重新启动。这些机制使得算法能够动态地适应变化,并以适当的速度抛弃过去的不相干的信息。在数字研究中,我们进一步展示了不同类型非静止环境下我们算法的强度。