Multi-armed bandit (MAB) problems are mainly studied under two extreme settings known as stochastic and adversarial. These two settings, however, do not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study a dynamic MAB problem with stochastic temporal structure, where the expected reward of each arm is governed by an auto-regressive (AR) model. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to be explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong (dynamic) benchmark. We then present an algorithm whose per-round regret almost matches our regret lower bound. Our algorithm relies on two mechanisms: (i) alternating between recently pulled arms and unpulled arms with potential, and (ii) restarting. These mechanisms enable the algorithm to dynamically adapt to changes and discard irrelevant past information at a suitable rate. In numerical studies, we further demonstrate the strength of our algorithm under non-stationary settings.
翻译:多臂赌博机(MAB)问题主要在随机和对抗两个极端设置下进行研究。然而,这两种设置不能捕捉到搜索引擎、市场和广告等实际环境,这些环境中奖励在时间上随机变化。受此启发,我们引入并研究了带有随机时间结构的动态MAB问题,其中每个臂的期望奖励由自回归(AR)模型控制。由于奖励的动态性质,简单的“探索和承诺”策略失败了,因为所有臂都必须持续探索。我们通过表征每轮遗憾下限来形式化这一点,其中遗憾以与强(动态)基准相比进行衡量。然后,我们提出了一种算法,其每轮遗憾几乎与我们的遗憾下限相匹配。我们的算法依赖于两种机制:(i)交替选择最近被拉臂和具有潜力的未被拉臂和(ii)重新启动。这些机制使算法能够动态适应变化,并以适当的速度丢弃无关的过去信息。在数值研究中,我们进一步证明了我们的算法在非稳态设置下的强大性能。