There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.
翻译:在连续提供数据的工业在线应用程序中,人们越来越感兴趣。在播放列表建议给用户的建议激励下,可以在整个播放列表中收集他们的喜好,我们研究一种新的土匪环境,即多武装盗匪和临时派奖赏(TP-MAB),在这种环境中,与拉动手臂有关的悬疑性奖赏被分成一定数量的连续轮,在拉动后,通过有限的连续几轮来分解。这种背景,迄今为止,我们最了解的情况尚未探讨,是延迟退缩的匪徒的自然延伸,在拉动之后,其奖赏可能超过一定时间范围,而不是在单一的、可能推迟的回合中充分披露。我们提供了两种算法来解决TP-MAB问题,即TP-UCB-FRFR和TP-UCB-EW,在拉动后,利用所收集的奖赏所披露的部分信息。我们算算算法比延迟退缩的土匪高。当我们从一个不折不扣式的媒体算法结构中评估其实际业绩,也就是从一个不折不折不折不扣的模拟的合成奖状结构中,从一个实际的合成奖项中产生。