We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
翻译:我们用 knapsacks (CBwK) 来研究背景上的土匪问题, 每一个行动都是在某种背景下采取的, 不仅导致随机的奖励, 而且还以矢量形式花费随机的资源消耗。 挑战是如何在不侵犯每种资源的预算的情况下最大限度地获得全部的奖励。 我们在一个总体的可变性环境下研究这一问题, 因为在一般功能类别中, 所预期的奖励和预期成本分别是环境和行动功能的函数 $\ mathcal{F} $ 和$\ mathcal{G} $。 CBwK 的现有工程仅限于线性功能类别, 因为它们使用非常依赖线性形式的UCB型算法, 因而难以扩展到普通功能类别。 我们受已成功应用到环境强盗的在线回归或触法驱动, 我们提出CBWK 的第一个普遍和最佳的算法框架, 将其降低到在线回归 。 我们还设定了较低的遗憾约束, 以显示各种功能类的算法的最佳性 。