We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
翻译:我们考虑的是无尽多武装匪徒的腐烂奖赏问题,在这种奖赏中,手臂每拉手臂的平均奖赏根据任意的趋势下降,最高腐烂率为$\varrho=o(1)美元。我们表明,这一学习问题有美元(Omega) ($maxávarrho=1/3}T,Sqrt{T}) 最差的情况是最低约束($T是地平线时间) 。我们还表明,一个匹配的上限($\tilde{O}(max ⁇ varrho}1/3}T,\sqrt{T}($),最高到一个多对数系数,可以通过一种算法实现,即每个手臂使用UCB指数和一个阈值来决定是继续拉一只胳膊还是把手臂从进一步考虑中除掉,当算法知道最大腐烂率$($)的价值时,当算法知道最大腐烂率值($_varrho}(max viráró}($__1/3}T,T,T_3/4_r_xrxxxrxrxrxrxxxxxrxxxxxxxxxxxxxxxxxx) 可以通过一个调整指数,可以通过一个不理解调定值。