We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems. We wish to learn a hidden $d$-dimensional value $w^*$. Every round, we are presented with a subset $\mathcal{X}_t \subseteq \mathbb{R}^d$ of possible actions. If we choose (i.e. recommend to the user) action $x_t$, we obtain utility $\langle x_t, w^* \rangle$ but only learn the identity of the best action $\arg\max_{x \in \mathcal{X}_t} \langle x, w^* \rangle$. We design algorithms for this problem which achieve regret $O(d\log T)$ and $\exp(O(d \log d))$. To accomplish this, we design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w^*$ and the hyperplanes the separation oracle returns. We also consider the variant where we are allowed to provide a list of several recommendations. In this variant, we give an algorithm with $O(d^2 \log d)$ regret and list size $\mathrm{poly}(d)$. Finally, we construct nearly tight algorithms for a weaker variant of this problem where the learner only learns the identity of an action that is better than the recommendation. Our results rely on new algorithmic techniques in convex geometry (including a variant of Steiner's formula for the centroid of a convex set) which may be of independent interest.
翻译:我们考虑的是由导航引擎和建议系统中的路线应用驱动的背景线性匪徒的以下变体。 我们希望学习一个隐藏的 $d$- 维值 $w+$。 每回合, 我们都会看到一个子集$\ mathcal{X ⁇ t\subseteq\ mathbb{R ⁇ d$d$。 如果我们选择( 向用户推荐) 行动 $x_ t$, 我们只能用低“ reret" 来设计新的开机算法 -- 真正的 $\\ maxx=in\ mathal{x{x}\ mathalal{X}\x}\ langlexx=xx, w\\\\\\\\\\\\\\\rcangle$。 我们设计了这个问题的算法, 我们的算法可以用来构建新的 $( dlogd dlog) 。 我们的解算法中, 我们的解算法可以让您在新的 解算法中 。