We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.
翻译:我们引入了一个多武装的土匪模型, 奖赏是多个随机变量的总和, 而每个动作只改变其中某些变量的分布。 每次行动之后, 代理人都会观察所有变量的实现情况。 这个模型的动机是营销运动和建议系统, 变量代表单个客户的结果, 比如点击。 我们建议使用UCB式的算法来估计行动在基线上的提升。 我们研究问题的多种变方, 包括当基线和受影响变量未知时, 并证明这些变量的次线性后悔。 我们还提供了更低的界限来证明我们模型假设的必要性。 对合成和真实世界数据集的实验显示了估算不使用此结构的政策的提升方法的好处 。