Nonstationary phenomena, such as satiation effects in recommendation, are a common feature of sequential decision-making problems. While these phenomena have been mostly studied in the framework of bandits with finitely many arms, in many practically relevant cases linear bandits provide a more effective modeling choice. In this work, we introduce a general framework for the study of nonstationary linear bandits, where current rewards are influenced by the learner's past actions in a fixed-size window. In particular, our model includes stationary linear bandits as a special case. After showing that the best sequence of actions is NP-hard to compute in our model, we focus on cyclic policies and prove a regret bound for a variant of the OFUL algorithm that balances approximation and estimation errors. Our theoretical findings are supported by experiments (which also include misspecified settings) where our algorithm is seen to perform well against natural baselines.
翻译:非静止现象,例如建议中的饱和效应,是连续决策问题的共同特征。虽然这些现象大多是在数量有限的武器强盗的框架内研究的,但在许多实际相关的案例中,线性强盗提供了更有效的示范选择。在这项工作中,我们为非静止线性强盗的研究引入了一个总体框架,目前的报酬受到学习者过去在固定规模窗口中的行动的影响。特别是,我们的模型将固定线性强盗作为一个特例列入其中。在显示最好的行动顺序难以在模型中进行计算之后,我们侧重于周期政策,并证明对OWL算法中平衡近似和估计误差的变法具有一定的遗憾。我们的理论结论得到了实验的支持(其中也包括错误的设置 ), 我们的算法被认为在自然基线上运行良好。