We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference of the expected cumulative reward of an agent choosing the optimal arm in every round and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the number of rounds and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$ and $N$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
翻译:我们研究的是非静止的多武装匪徒问题,每个手臂的奖励统计数字在学习过程中可能会发生数倍的变化。学习算法的表现是根据其动态遗憾来评估的,这种遗憾的定义是每个回合选择最佳手臂的代理人的预期累积报酬的差额和学习算法的累积报酬的累积报酬的累积差额。衡量这种环境的难度的方法之一是考虑最佳臂的身份可以改变多少倍。我们建议一种方法,用美元-武装匪徒的问题,在奖励职能方面实现接近最佳的美元-全局的O(sqrt{KN(S+1)})美元-动态遗憾,其中美元是圆轮数,美元是最佳手臂变化的特性的倍数,而事先不了解美元和美元。以前为这一问题而做的工作会因数额的变化(或变化数额)而感到遗憾,这种变化可能大得多,或假定事先知道美元-美元,以达到相似的界限。