A standard assumption adopted in the multi-armed bandit (MAB) framework is that the mean rewards are constant over time. This assumption can be restrictive in the business world as decision-makers often face an evolving environment where the mean rewards are time-varying. In this paper, we consider a non-stationary MAB model with $K$ arms whose mean rewards vary over time in a periodic manner. The unknown periods can be different across arms and scale with the length of the horizon $T$ polynomially. We propose a two-stage policy that combines the Fourier analysis with a confidence-bound-based learning procedure to learn the periods and minimize the regret. In stage one, the policy correctly estimates the periods of all arms with high probability. In stage two, the policy explores the periodic mean rewards of arms using the periods estimated in stage one and exploits the optimal arm in the long run. We show that our learning policy incurs a regret upper bound $\tilde{O}(\sqrt{T\sum_{k=1}^K T_k})$ where $T_k$ is the period of arm $k$. Moreover, we establish a general lower bound $\Omega(\sqrt{T\max_{k}\{ T_k\}})$ for any policy. Therefore, our policy is near-optimal up to a factor of $\sqrt{K}$.
翻译:多武装土匪(MAB)框架采用的标准假设是,平均报酬是长期不变的。这种假设在商业界可能是限制性的,因为决策者经常面对一种不断变化的环境,而平均报酬是时间变化的。在本文中,我们考虑的是非静止的MAB模式,其平均报酬是K美元,其平均报酬是定期的。不同武器和规模的未知时期可能不同,其间平面值为$T$ 多元球度。我们建议了两阶段政策,将Fourier分析与基于信任的学习程序相结合,以学习时期并尽量减少遗憾。在第一阶段,政策正确估计所有武器的期限。在第二阶段,我们考虑的是使用第一阶段估计的定期平均武器报酬,并在长期利用最佳手臂。我们显示,我们的学习政策需要为最高约束$T=K=K=K=KT_k}(sqrt=K=K=K+K}美元,其中$T_k$是接近美元的政策期限。此外,我们为General_k_Orts)。此外,我们为整个政策设定了一个较低的约束。