Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce and solve a general class of non-stationary multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from up to $K\,(\ge 1)$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the arm's idle time increases. With the objective of maximizing the expected cumulative reward over $T$ time periods, we design a class of ``Purely Periodic Policies'' that jointly set a period to pull each arm. For the proposed policies, we prove performance guarantees for both the offline problem and the online problems. For the offline problem when all model parameters are known, the proposed periodic policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be dynamically learned, we integrate the offline periodic policy with the upper confidence bound procedure to construct on online policy. The proposed online policy is proved to approximately have $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may shed light on broader offline planning and online learning applications with non-stationary and recovering rewards.
翻译:受在线电子商务、促销和建议等新兴应用的推动,我们推出并解决了非固定性多武装土匪问题的一般类别,其目标有以下两个特点:(一) 决策者可以在每个时期从不同武器中从最高到美元(Ge 1美元)调出并收取收益;(二) 手臂在拉动后预期会立即下降,然后随着手臂闲置时间的增加而非对称恢复。为了最大限度地增加预期在美元时段的累计报酬,我们设计了“双周期政策”类别,共同设定了拉动每个手臂的时间。对于拟议的政策,我们证明对离线问题和在线问题都有了业绩保障。对于所有模型参数已知的离线问题,拟议的定期政策获得了近似比率,这在1美元时将O(1/qrt{K}闲置时间增加。当美元时,我们将在线应用的预期累积性最优化到无法确定的时间段值。对于在线政策来说,在不固定的模型和在线政策上,我们需要在不固定的模型上构建一个不固定的模型,当我们逐渐构建政策时,在网上学习的时候,我们需要将在线政策升级的升级的模型到不断构建。