Recently, the study of linear misspecified bandits has generated intriguing implications of the hardness of learning in bandits and reinforcement learning (RL). In particular, Du et al. (2020) show that even if a learner is given linear features in $\mathbb{R}^d$ that approximate the rewards in a bandit or RL with a uniform error of $\varepsilon$, searching for an $O(\varepsilon)$-optimal action requires pulling at least $\Omega(\exp(d))$ queries. Furthermore, Lattimore et al. (2020) show that a degraded $O(\varepsilon\sqrt{d})$-optimal solution can be learned within $\operatorname{poly}(d/\varepsilon)$ queries. Yet it is unknown whether a structural assumption on the ground-truth parameter, such as sparsity, could break the $\varepsilon\sqrt{d}$ barrier. In this paper, we address this question by showing that algorithms can obtain $O(\varepsilon)$-optimal actions by querying $O(\varepsilon^{-s}d^s)$ actions, where $s$ is the sparsity parameter, removing the $\exp(d)$-dependence. We then establish information-theoretical lower bounds, i.e., $\Omega(\exp(s))$, to show that our upper bound on sample complexity is nearly tight if one demands an error $ O(s^{\delta}\varepsilon)$ for $0<\delta<1$. For $\delta\geq 1$, we further show that $\operatorname{poly}(s/\varepsilon)$ queries are possible when the linear features are "good" and even in general settings. These results provide a nearly complete picture of how sparsity can help in misspecified bandit learning and provide a deeper understanding of when linear features are "useful" for bandit and reinforcement learning with misspecification.
翻译:稀疏性有助于学习错误线性赌博机吗?
翻译后的摘要:
最近,线性错误赌博机学习领域提出了一些关于在赌博机和强化学习中学习的困难性的有趣讨论。特别地,Du等人(2020)展示了即使学习者能够获取 $\mathbb{R}^d$ 中能够近似赌博机或强化学习的线性特征向量,并且使误差不超过 $\varepsilon$,搜寻一个 $O(\varepsilon)$ - 优化的动作也需要至少 $\Omega(\exp(d))$ 次选择。此外,Lattimore等人(2020)研究表明,即使在 $O(\varepsilon\sqrt{d})$ 最差情况下,依然可以通过 $\operatorname{poly}(d/\varepsilon)$ 次选择数据学习解决方案。然而,目前还不清楚结构参数(如稀疏性)是否能够打破 $\varepsilon\sqrt{d}$ 的难关。本文回答了这个问题,它表明算法可以通过查询 $O(\varepsilon^{-s}d^s)$ 次动作来获得 $O(\varepsilon)$ - 优化的动作,这里 $s$ 是稀疏参数,消除了 $\exp(d)$ 的依赖。然后我们建立了信息论下界,即 $\Omega(\exp(s))$,以表明我们的样本复杂度上界是几乎紧致的,如果需求误差 $O(s^{\delta}\varepsilon)$,其中 $0<\delta<1$。对于 $\delta\geq 1$,进一步表明当线性特征是“好”的时候,依然可以在一般情况下查询 $O(s/\varepsilon)$ 次。这些结果提供了稀疏性对错误赌博机学习的帮助,进一步加深了线性特征何时在赌博机和强化学习领域"有用"的理解。