In $\mathcal{X}$-armed bandit problem an agent sequentially interacts with environment which yields a reward based on the vector input the agent provides. The agent's goal is to maximise the sum of these rewards across some number of time steps. The problem and its variations have been a subject of numerous studies, suggesting sub-linear and some times optimal strategies. The given paper introduces a novel variation of the problem. We consider an environment, which can abruptly change its behaviour an unknown number of times. To that end we propose a novel strategy and prove it attains sub-linear cumulative regret. Moreover, in case of highly smooth relation between an action and the corresponding reward, the method is nearly optimal. The theoretical result are supported by experimental study.
翻译:在 $\ mathcal{X} $x} 有武装的土匪问题中, 代理人依次与环境发生互动, 根据代理人提供的矢量输入, 产生奖赏。 代理人的目标是在一定时间步骤中将这些奖赏的总数最大化。 这个问题及其差异已经成为许多研究的主题, 暗示了亚线性和一些时候的最佳战略。 上一张论文提出了这个问题的新变异。 我们认为环境可以突然改变其行为, 次数不详。 为此, 我们提出了一个新战略, 并证明它取得了亚线性累积遗憾。 此外, 如果行动与相应奖赏的关系非常顺利, 方法几乎是最佳的。 实验性研究为理论结果提供了支持 。