We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.
翻译:我们将多重玩耍的多武装强盗(MP-MAB)问题与可分享的手臂设置(MP-MAB)问题普遍化,其中若干种可以分享同一手臂。此外,每个可分享的手臂都有有限的奖赏能力和“每负”奖赏分配,两者都是学习者所不知道的。从一个可分享的手臂得到的奖赏是依靠负载的,即“每负”奖赏,它乘以拉动手臂的游戏次数,或在播放次数超过能力限度时其奖赏能力。当“每负”奖赏在高山发行之后,我们证明从依赖负载的奖赏中学习能力的能力的样本复杂性较低,而且对这个新的MP-MAB问题也有较低的约束感到遗憾。我们设计了一名能力估计者,其抽样复杂性在奖赏手段和能力方面与较低的约束方面是相匹配的。我们还提议一个在线学习算法,用以解决问题并证明其遗憾的上限。这一上限的第一个任期与低约束奖项相同,其第二和第三名词也明显与较低约束的成绩相符。