We study the sequential resource allocation problem where a decision maker repeatedly allocates budgets between resources. Motivating examples include allocating limited computing time or wireless spectrum bands to multiple users (i.e., resources). At each timestep, the decision maker should distribute its available budgets among different resources to maximize the expected reward, or equivalently to minimize the cumulative regret. In doing so, the decision maker should learn the value of the resources allocated for each user from feedback on each user's received reward. For example, users may send messages of different urgency over wireless spectrum bands; the reward generated by allocating spectrum to a user then depends on the message's urgency. We assume each user's reward follows a random process that is initially unknown. We design combinatorial multi-armed bandit algorithms to solve this problem with discrete or continuous budgets. We prove the proposed algorithms achieve logarithmic regrets under semi-bandit feedback.
翻译:我们研究决策者反复在资源之间分配预算的顺序资源分配问题。 吸引人的例子包括将有限的计算时间或无线频谱带分配给多个用户( 即资源 ) 。 每次时间步骤时, 决策者应将其可用预算分配给不同的资源, 以尽量获得预期的回报, 或同等地将累积的遗憾降到最低 。 这样做时, 决策者应该从每个用户收到的奖励的反馈中了解分配给每个用户的资源的价值。 例如, 用户可以发送无线频谱带的不同紧急信息; 将频谱分配给用户所产生的奖励则取决于信息的迫切性。 我们假定每个用户的奖赏都遵循一个最初未知的随机过程。 我们设计组合式多臂宽频谱算法, 以独立或连续的预算解决这个问题。 我们证明提议的算法在半波段反馈下实现了对数性遗憾。