Autoregressive processes naturally arise in a large variety of real-world scenarios, including e.g., stock markets, sell forecasting, weather prediction, advertising, and pricing. When addressing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for converge to the optimal decision policy. In this work, we propose a novel online learning setting, named Autoregressive Bandits (ARBs), in which the observed reward follows an autoregressive process of order $k$, whose parameters depend on the action the agent chooses, within a finite set of $n$ actions. Then, we devise an optimistic regret minimization algorithm AutoRegressive Upper Confidence Bounds (AR-UCB) that suffers regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} \right)$, being $T$ the optimization horizon and $\Gamma < 1$ an index of the stability of the system. Finally, we present a numerical validation in several synthetic and one real-world setting, in comparison with general and specific purpose bandit baselines showing the advantages of the proposed approach.
翻译:自动递减过程自然会出现于各种各样的现实世界情景中,其中包括股票市场、销售预测、天气预测、广告和定价等。当在此背景下处理顺序决策问题时,连续观测之间的时间依赖性应适当计入最佳决策政策。在这项工作中,我们提议建立一个名为自动递减强盗匪的新的在线学习环境(ARBs),在这种环境中,观察到的奖赏是自动递减过程美元,其参数取决于代理商在一定的美元行动范围内选择的行动。然后,我们设计了一个乐观的遗憾最小化自动递增高级信任犬(AR-UCB),它令全局性最高信任犬(O ⁇ \left (\frac{(k+1)\\%3/2 ⁇ sqrt{nT ⁇ }(1-\\\\Gamma)2}。 美元,其参数取决于代理商在一定的一套美元行动范围内选择的行动。然后,我们设计了一个乐观的最小最小最小最小最小最小的算法(AGamma < 1) 指数。最后,我们展示了几个合成基线上的拟议数字验证方法的优势。