We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.
翻译:我们用复合和匿名反馈来研究多武装土匪(MAB)问题。 在这个模型中,拉手臂的奖励在一段时间内扩散(我们称这个时期为奖赏间隔),玩家从动作中得到部分回报,连续地从拉其他手臂得到报酬。这个模型的现有结果要求事先了解奖赏间隔大小,作为算法的输入。在本文中,我们建议对随机和对立案例采用适应性算法,而无需事先了解奖励间隔。对于这个随机案例,我们证明我们的算法保证了与较低界限(按顺序)相匹配的遗憾。对于对抗性案例,我们建议采用第一个算法,共同处理非明显的对立和未知的奖赏间隔大小。我们还根据真实世界数据集进行模拟。结果显示,我们的算法超过了现有的基准。