Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of "Purely Periodic Policies". For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.
翻译:在诸如“实况流”电子商务、促销和建议等新兴应用的推动下,我们引入了一个具有以下两个特点的多武装匪徒问题的一般类别:(一) 决策者可以在每个时期从不同的武器中拿出最多不超过1美元的款项,并从中收取报酬;(二) 手臂在被拉走后预期的奖励立即下降,然后随着闲置时间的增加而不光彩地恢复。为了尽可能增加预期的累积收益,在T美元的时间段内,我们提议、建立并证明“定期政策”一类的绩效保障。对于所有示范参数已知的离线问题,我们拟议的政策获得的近似比率为1美元左右的O(1/\sqrt{K}美元;当美元逐渐增长到无限的时候,这一预期的奖励就会立即下降,然后随着闲置时间的增加。对于当模型参数未知而需要了解时的在线问题,我们设计了一个基于“高度信任(UCB)”的政策,其基础大约已经比Uplitele Omacal Oral ad adline 和“Orestal ” (N\\\\\\\\\\\\ recreal ress real des real des des des des des des des) 而不是我们的非学习框架,可能调整了我们的非学习框架。