We consider the problem of sequentially allocating resources in a censored semi-bandits setup, where the learner allocates resources at each step to the arms and observes loss. The loss depends on two hidden parameters, one specific to the arm but independent of the resource allocation, and the other depends on the allocated resource. More specifically, the loss equals zero for an arm if the resource allocated to it exceeds a constant (but unknown) arm dependent threshold. The goal is to learn a resource allocation that minimizes the expected loss. The problem is challenging because the loss distribution and threshold value of each arm are unknown. We study this setting by establishing its `equivalence' to Multiple-Play Multi-Armed Bandits (MP-MAB) and Combinatorial Semi-Bandits. Exploiting these equivalences, we derive optimal algorithms for our problem setting using known algorithms for MP-MAB and Combinatorial Semi-Bandits. The experiments on synthetically generated data validate the performance guarantees of the proposed algorithms.
翻译:我们考虑的是按顺序在受审查的半腰部设置中分配资源的问题,在这种设置中,学习者将每一步的资源分配给手臂并观察损失情况。损失取决于两个隐藏参数,一个是手臂特有的,但独立于资源分配,另一个则取决于分配的资源。更具体地说,如果分配给一个手臂的资源超过一个常数(但未知的)手臂依赖阈值,则其损失等于其零。目标是学习如何分配资源,以尽量减少预期的损失。由于每个手臂的损失分布和阈值未知,问题十分严峻。我们研究这一设置时,通过建立其“等同性”到多盘多盘多臂盗(MP-MAB)和组合式半盘式半盘算法。利用这些等同性,我们利用已知的MP-MAB和组合半盘算法确定我们的问题设置,得出最佳算法。关于合成数据实验证实了拟议算法的性保证。