Low-complexity models such as linear function representation play a pivotal role in enabling sample-efficient reinforcement learning (RL). The current paper pertains to a scenario with value-based linear representation, which postulates the linear realizability of the optimal Q-function (also called the "linear $Q^{\star}$ problem"). While linear realizability alone does not allow for sample-efficient solutions in general, the presence of a large sub-optimality gap is a potential game changer, depending on the sampling mechanism in use. Informally, sample efficiency is achievable with a large sub-optimality gap when a generative model is available but is unfortunately infeasible when we turn to standard online RL settings. In this paper, we make progress towards understanding this linear $Q^{\star}$ problem by investigating a new sampling protocol, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states in a controlled and infrequent manner. This protocol is more flexible than the standard online RL setting, while being practically relevant and far more restrictive than the generative model. We develop an algorithm tailored to this setting, achieving a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap, but not the size of the state/action space. Our findings underscore the fundamental interplay between sampling protocols and low-complexity structural representation in RL.
翻译:线性功能代表制等低复杂度模型在使样本高效强化学习(RL)方面发挥着关键作用。 本文涉及一种基于价值的线性代表制情景,它假定了最佳Q功能(也称为“线性$ ⁇ 星}问” )的线性真实性。 虽然光线性真实性单靠直线性并不允许一般的样本高效解决方案,但大型亚最佳性差距的存在是一个潜在的游戏变异器,取决于正在使用的抽样机制。 非正式地说,抽样效率是可以实现的,当有基于价值的模型存在,但不幸的是当我们转向标准的在线RL设置时无法做到。 在本文中,我们通过调查新的抽样协议在理解这个线性 $ ⁇ 星} 问题方面取得进展。 新的抽样协议以在线/勘探方式抽取样本,但允许一个人以受控和不经常的方式背对以前的各州进行重新审视。 该议定书比标准的在线RL设置更为灵活,而实际相关和限制程度远大于谱性模型模型。 在低度的模型中,我们制定了一个符合本类性结构性比例的矩阵,在基础性结构-范围之间,我们根据一个不甚复杂度的样本/范围,我们制定了一个不同层次的样本范围定义,我们之间的空间结构-范围定义,在确定一个不同层次上,从而实现了我们的结构-