We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] (but we show that the techniques introduced in the present article may also be applied to the case of these linear structures). The adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.
翻译:我们考虑的是带有 knapsacks 的背景土匪, 其基本结构介于所产生回报和成本矢量之间的结构。 我们这样做的动机是以商业折扣进行销售。 在每轮中, 我们这样做的动机是用商业折扣来进行销售。 在每轮中, 以 i. d. d. d. 上下文 $\ mathbf{x ⁇ t$ 和手臂取美元t$( 对应, 例如, 折现到折现水平) 的情况下, 可以实现客户转换。 在这种情况下, 奖励和成本之间的基本结构与Agrawal 和 Devanur (2016) 所考虑的线性结构不同( 但是我们表明, 本文章中引入的技术 $ (a_ t, mathb{x{x} 以商业折价计价, 例如, 损益损失。 否则, 在没有转换的情况下, 奖励和成本是无效的。 因此, 得的回报和成本是通过二进变量的变换值。 奖与成本之间的基本结构, 与 Agrawalal结构的直线性政策的总数值。