We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] but we show that the techniques introduced in this article may also be applied to the latter case. Namely, the adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.
翻译:我们考虑的是带有 knapsacks 的背景土匪, 其基本结构介于所产生回报和成本矢量之间的结构。 我们这样做的动机是以商业折扣进行销售。 在每轮中, 我们这样做的动机是用商业折价进行销售。 在每轮中, 以 i. d. d. d. 上下文 $\ mathbf{x ⁇ t$ 和 手臂取美元t$( 对应, 例如, 折价为折价 ), 可以实现客户转换( 折价, 折价为折价 ) 。 在这种情况下, 奖励和成本之间的基本结构与Agrawal 和 Devanur (2016) 所考虑的线性结构不同, 但是我们表明, 本文章中引入的技巧也可能用美元( a t, t, mathb) 折价 美元 来进行交易。 否则, 在没有转换的情况下, 奖励和成本是无效的。 因此, 收益和成本是通过二进变量的变价政策, 以美元 美元 和 美元 美元 的货币 递定值 。