It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. This motivates much of the recent theoretical study on linear MDPs. However, most approaches require a given representation under unrealistic assumptions about the normalization of the decomposition or introduce unresolved computational challenges in practice. Instead, we consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning via contrastive estimation. The framework also admits confidence-adjusted index algorithms, enabling an efficient and principled approach to incorporating optimism or pessimism in the face of uncertainty. To the best of our knowledge, this provides the first practical representation learning method for linear MDPs that achieves both strong theoretical guarantees and empirical performance. Theoretically, we prove that the proposed algorithm is sample efficient in both the online and offline settings. Empirically, we demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
翻译:通过利用低级别代表制来解决Markov决策程序中对维度的诅咒问题是常见的。这促使了最近对线性MDP的理论研究。然而,大多数方法要求在对分解正常化的不现实假设下有一定的代表性,或在实践中提出尚未解决的计算挑战。相反,我们考虑线性MDP的替代定义,自动确保正常化,同时允许通过对比性估计进行有效的代表制学习。框架还接受信任调整指数算法,从而能够在面对不确定性时采用有效和有原则的方法纳入乐观或悲观。根据我们所知,这为线性MDP提供了第一个实际的代表性学习方法,既能实现强有力的理论保证,又能取得经验性业绩。理论上,我们证明拟议的算法在在线和离线环境中都是有效的样本。我们很生动地表明,我们在若干基准上比现有最先进的基于模型和无模式的算法表现优。