In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to a linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an any-time optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order O(c d sqrt(T)), where c is a constant dependent on the properties of the linear dynamical evolution, and d is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines.
翻译:在许多现实世界的顺序决策问题中,一项行动并不立即反思反馈,并在很长的时间内传播其影响。例如,在网上广告中,投资于平台会提高人们的认识,但实际的奖励,即转换,可能会在未来发生。此外,转换是否发生取决于:认识增长的速度、其消失的影响,以及与其他广告平台的协同作用或干扰。此前的工作调查了多Armed Bandit 框架,有可能延迟和汇总依赖性反馈,而没有关于一项行动如何在未来传播的特定结构,无视可能的动态效应。在本文中,我们引入了一个新的设置,即动态线性基线(DLB),这是以隐藏状态为特征的线性强盗的延伸。在采取行动时,学习者会看到一个响亮的奖励,其意思是隐藏状态和动作的线性功能。随后,隐藏状态根据线性动态动态变化演变,受所实施的行动也受到影响。我们开始引入关于最佳政策的设置,讨论最佳政策的概念,并且将动态性D值的数值推算出一个我们所期望的B级的数值排序。