具有密度特征的低Rank MDP的强化学习 (Reinforcement Learning in Low-Rank MDPs with Density Features)

MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.

翻译：低调过渡的MDP -- -- 即,过渡矩阵可以纳入左向和右两个矩阵的产物 -- -- 是一个具有高度代表性的结构,能够进行可移植的学习。左矩阵使基于价值的学习具有显性功能近似近似值,并且已经对此进行了广泛研究。在这项工作中,我们相反地调查具有密度特征的抽样高效学习,即右矩阵,这为州-占用分布带来了强大的模型。这个设置不仅揭示了在RL中利用不受监督的学习的优势,而且还为 convex RL提供了插座解决方案。在离线设置中,我们建议对可处理非探索性数据的隐性隐性隐含性进行非政策性估算算法。我们用这个亚常规方法进一步设计一种在线算法,以逐级方式构建探索性数据分布。作为一个中心技术挑战,占用估计的叠加错误与数据覆盖的多重定义不相容误。在缺乏强有力的假设的情况下,这种互不相容性很容易导致指数性错误的爆发。在不相容性的情况下,我们很容易地通过新的技术工具来学习一个未知的密度,我们很容易地学习一个未知的模型。