In contextual continuum-armed bandits, the contexts $x$ and the arms $y$ are both continuous and drawn from high-dimensional spaces. The payoff function to learn $f(x,y)$ does not have a particular parametric form. The literature has shown that for Lipschitz-continuous functions, the optimal regret is $\tilde{O}(T^{\frac{d_x+d_y+1}{d_x+d_y+2}})$, where $d_x$ and $d_y$ are the dimensions of contexts and arms, and thus suffers from the curse of dimensionality. We develop an algorithm that achieves regret $\tilde{O}(T^{\frac{d_x+1}{d_x+2}})$ when $f$ is globally concave in $y$. The global concavity is a common assumption in many applications. The algorithm is based on stochastic approximation and estimates the gradient information in an online fashion. Our results generate a valuable insight that the curse of dimensionality of the arms can be overcome with some mild structures of the payoff function.
翻译:在连续武装的土匪中,背景值x美元和军火美元都是连续的,并且是从高维空间中提取的。学习(x,y)美元的报酬功能没有特定的参数形式。文献显示,对于利普西茨-连续功能而言,最佳的遗憾是$\tilde{O}(T ⁇ frac{d_x+d_y+1 ⁇ d_x+d_d_y+2 ⁇ )美元,美元和美元是背景和武器方位的维度,因此受维度的诅咒影响。我们开发了一种算法,当美元是全球性的,以美元为单位时,则会以美元为单位。全球混凝土是许多应用中常见的假设。算法基于随机近似和估算,并以在线方式估算梯度信息。我们的结果产生了宝贵的洞察力,即武器维度的诅咒可以通过某种温和的支付功能结构来克服。