We investigate the Multi-Armed Bandit problem with Temporally-Partitioned Rewards (TP-MAB) setting in this paper. In the TP-MAB setting, an agent will receive subsets of the reward over multiple rounds rather than the entire reward for the arm all at once. In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property. Such a generalization is needed to be able to handle partitioned rewards in which the maximum reward per round is not distributed uniformly across rounds. We derive a lower bound on the TP-MAB problem under the assumption that Beta-spread holds. Moreover, we provide an algorithm TP-UCB-FR-G, which uses the Beta-spread property to improve the regret upper bound in some scenarios. By generalizing how the cumulative reward is distributed, this setting is applicable in a broader range of applications.
翻译:我们调查了本文中Temporive-parted Rewards(TP-MAB)设置的多武装盗匪问题。在TP-MAB设置中,一个代理人将获得多轮奖励的子集,而不是一次性全部对手臂的全部奖励。在本文中,我们提出了如何在几轮中分配一个手臂的累积性奖赏,称为Beta-spreaty 属性。这种概括化需要能够处理分配式奖励,即每轮最高奖赏分配不一的分割性奖励。在Beta-spreaty持有的假设下,我们对TP-MAB问题的限制较低。此外,我们提供了一种算法 TP-UCB-FR-G,它利用Beta-spreal 属性来改善某些情景中最上层的遗憾。通过概括化累积性奖赏的分配方式,这一设置适用于更广泛的应用。