In view of its power in extracting feature representation, contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL), leading to efficient policy learning in various applications. Despite its tremendous empirical successes, the understanding of contrastive learning for RL remains elusive. To narrow such a gap, we study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss. Moreover, under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs. We further theoretically prove that our algorithm recovers the true representations and simultaneously achieves sample efficiency in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. To the best of our knowledge, we provide the first provably efficient online RL algorithm that incorporates contrastive learning for representation learning. Our codes are available at https://github.com/Baichenjia/Contrastive-UCB.
翻译:鉴于自身在提取特征代表方面的力量,对比式自我监督的学习成功地融入了(深入)强化学习(RL)实践,从而在各种应用中实现有效的政策学习。尽管取得了巨大的经验性成功,但是对RL对比性学习的理解仍然渺茫。为了缩小这一差距,我们研究如何通过在马尔科夫决策程序(MDPs)和马克夫游戏(MGs)中以低级别过渡的对比性学习来增强RL的权能。关于这两种模式,我们提议通过尽量减少对比性损失来提取低级别模式的正确特征表现。此外,在网上设置中,我们提出了新的上层信任(UCB)型算法,将这种对比性损失与MDPs或MGs在线RL算法结合起来。我们从理论上进一步证明我们的算法恢复了真实的表达方式,同时在学习MDPs和MGs之间的最佳政策和纳什平衡方面实现了抽样效率。我们还提供实验性研究,以展示基于UCB的对比性学习RL/RB模式的功效。我们的知识中的第一个在网上学习高效的RB/在线算法。我们在学习中提供最佳的ABAL.