Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory.
翻译:基于在强化学习(RL)问题[tang2017 Exploration, bellemare2016unization]中利用不受监督的学习来有效探索(RL)的流行范式,我们调查了这一范式何时可以实现效率。我们用少数潜在状态产生的丰富观察来研究附带的Markov决策程序。我们提出了一个建立在两个组成部分之上的一般算法框架:不受监督的学习算法和无结果的表格RL算法。理论上,我们证明只要未经监督的学习算法具有多元样本复杂性的保证,我们就能在潜在状态中找到一个近乎最佳的政策,其样本复杂度远远小于观测数量的多数值。我们很自然地将我们的框架放在一个困难的探索问题类别上,以证明我们理论的实用性。