We study two model selection settings in stochastic linear bandits (LB). In the first setting, the reward parameter of the LB problem is arbitrarily selected from $M$ models represented as (possibly) overlapping balls in $\mathbb R^d$. However, the agent only has access to misspecified models, i.e., estimates of the centers and radii of the balls. We refer to this setting as parameter selection. In the second setting, which we refer to as feature selection, the expected reward of the LB problem is in the linear span of at least one of $M$ feature maps (models). For each setting, we develop and analyze an algorithm that is based on a reduction from bandits to full-information problems. This allows us to obtain regret bounds that are not worse (up to a $\sqrt{\log M}$ factor) than the case where the true model is known. Our parameter selection algorithm is OFUL-style and the one for feature selection is based on the SquareCB algorithm. We also show that the regret of our parameter selection algorithm scales logarithmically with model misspecification.
翻译:我们在随机线性强盗(LB)中研究两个模型选择设置。 在第一个设置中, LB 问题的奖赏参数是任意从以$\mathbb R ⁇ d$表示的(可能)重叠球的模型中任意选择的$M$问题。 然而, 代理商只能使用错误描述的模型, 即中心估计值和球的弧度值。 我们将此设置称为参数选择。 在第二个设置中, 我们称之为特征选择, LB 问题的预期奖赏是在至少一个$M$的特征地图( 模型) 的线性范围内。 对于每个设置, 我们开发和分析一种基于从强盗减为完整信息问题的算法。 这使我们能够获得不比真实模型已知的情况更差的遗憾界限( $\ sqrt\log } M} 系数 ) 。 我们的参数选择算法是 Owrist, 特征选择法是以 Sqrequeb 算法为基础的。 我们还表明, 我们的参数选择算法有误差。