Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.
翻译:深潜可变模型在模型强化学习(RL)方面取得了重大的经验成功,因为它们在模拟复杂的过渡动态方面表现得非常明确。另一方面,在理论上和实验上,仍然不清楚潜在的可变模型如何有助于学习、规划和探索,以提高RL的抽样效率。 在本文中,我们对州-行动价值功能的潜在可变模型进行了介绍,这种模式既允许可移动的可变学习算法,又允许在面临不确定性的情况下有效执行乐观/悲观原则。特别是,我们建议采用一个计算效率高的规划算法,通过吸收潜伏可变模型的内核嵌入来进行UCB的探索。理论上,我们在在线和离线环境中确定拟议方法的样本复杂性。我们经常地表明,在各种基准中,我们比当前最先进的算法表现优于当前最先进的算法。</s>