In the problem of (binary) contextual bandits with knapsacks (CBwK), the agent receives an i.i.d. context in each of the $T$ rounds and chooses an action, resulting in a random reward and a random consumption of resources that are related to an i.i.d. external factor. The agent's goal is to maximize the accumulated reward under the initial resource constraints. In this work, we combine the re-solving heuristic, which proved successful in revenue management, with distribution estimation techniques to solve this problem. We consider two different information feedback models, with full and partial information, which vary in the difficulty of getting a sample of the external factor. Under both information feedback settings, we achieve two-way results: (1) For general problems, we show that our algorithm gets an $\widetilde O(T^{\alpha_u} + T^{\alpha_v} + T^{1/2})$ regret against the fluid benchmark. Here, $\alpha_u$ and $\alpha_v$ reflect the complexity of the context and external factor distributions, respectively. This result is comparable to existing results. (2) When the fluid problem is linear programming with a unique and non-degenerate optimal solution, our algorithm leads to an $\widetilde O(1)$ regret. To the best of our knowledge, this is the first $\widetilde O(1)$ regret result in the CBwK problem regardless of information feedback models. We further use numerical experiments to verify our results.
翻译:在使用 knapsacks (CBwK) 的背景强盗( 二进制) 问题中, 代理商在每轮美元交易中都收到一.d. 上下文, 并选择一个行动, 导致随机奖励和随机消耗与i. d. 外部因素有关的资源。 代理商的目标是在初始资源限制下最大限度地增加累积的奖励。 在这项工作中, 我们结合了在收入管理方面证明成功的再解决过度主义, 并采用分配估算技术来解决这个问题。 我们考虑两种信息反馈模型, 包括全部和部分信息, 难以获得外部因素样本。 在两种信息反馈环境下, 我们取得了双向的结果:(1) 对于一般问题, 我们的算法得到了 $( Täalpha_ u} + T ⁇ alpha_ v} + T ⁇ % 1/2} 。 我们的算法比液基准更成功, 我们的计算结果更复杂和外部因素分布。 当我们的最佳计算结果 与整个O 的算法性结果比起来, 我们的算算为 。