Although deep reinforcement learning (RL) has recently enjoyed many successes, its methods are still data inefficient, which makes solving numerous problems prohibitively expensive in terms of data. We aim to remedy this by taking advantage of the rich supervisory signal in unlabeled data for learning state representations. This thesis introduces three different representation learning algorithms that have access to different subsets of the data sources that traditional RL algorithms use: (i) GRICA is inspired by independent component analysis (ICA) and trains a deep neural network to output statistically independent features of the input. GrICA does so by minimizing the mutual information between each feature and the other features. Additionally, GrICA only requires an unsorted collection of environment states. (ii) Latent Representation Prediction (LARP) requires more context: in addition to requiring a state as an input, it also needs the previous state and an action that connects them. This method learns state representations by predicting the representation of the environment's next state given a current state and action. The predictor is used with a graph search algorithm. (iii) RewPred learns a state representation by training a deep neural network to learn a smoothed version of the reward function. The representation is used for preprocessing inputs to deep RL, while the reward predictor is used for reward shaping. This method needs only state-reward pairs from the environment for learning the representation. We discover that every method has their strengths and weaknesses, and conclude from our experiments that including unsupervised representation learning in RL problem-solving pipelines can speed up learning.
翻译:虽然深层强化学习(RL)最近取得了许多成功,但其方法仍然是数据效率低下,因此解决了许多数据方面过于昂贵的许多问题。我们的目标是利用未贴标签的数据中的丰富的监督信号来纠正这一点,以学习状态的演示。本论文提出了三种不同的代表性学习算法,这些算法可以使用传统的RL算法所使用的数据源的不同子集:(一) GRIICA受独立组成部分分析(ICA)的启发,并训练一个深层神经网络,以便从统计上独立地输出输入的弱点。GRIICA这样做的方式是最大限度地减少每个特性和其他特性之间的相互信息。此外,GRIICA只需要收集环境状态状态的未经分类的收集。 (二) 延迟代表性(LARP)需要更多背景:除了要求一个状态作为投入,它也需要以前的状态和连接它们的行动。 这种方法通过预测环境的下一个状况和动作,预测器可以用图表搜索算法来使用。(三) REPRICA) 只需要收集环境状况的不公开信息。在深度精细度代表制的模型中学习方式上,我们使用的方法是学习一个深度精细度代表。