In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work.
翻译:在稀疏的线性土匪中,学习代理人依次选择一项行动并接受奖励反馈,而奖励功能则依赖行动共变数的几个坐标线性功能。这在许多现实世界的顺序决策问题中都有应用。在本文中,我们提议了一种简单且计算高效的稀有线性估计方法,即PopArt,与Lasso(Tibshirani,1996年)相比,在许多问题中,该方法的回收保证金额更近于1美元。我们捆绑的自然激励着一个实验设计标准,该标准是锥形的,从而具有计算效率。根据我们新的估计和设计标准,我们从中得出了稀有线性线性土匪算法,这些算法在艺术状态(Hao等人,2020年)上层享有更好的遗憾,特别是W.r.t. 。最后,我们证明,在数据贫乏的制度中,对于稀少线性土匪来说,其约束程度较低,缩小了先前工作上下界之间的距离。