We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. The framework, called GenRL, trains deep policies by introducing an action latent variable such that the feed-forward policy search can be divided into two parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, and (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable. GenRL enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training on two robotics tasks: shooting a hockey puck and throwing a basketball. Furthermore, we empirically demonstrate that GenRL is the only method which can safely and efficiently solve the robotics tasks compared to two state-of-the-art RL methods.
翻译:我们提出了一个数据效率框架,用于解决连续决策问题,利用强化学习(RL)和潜在变异变异变异模型的结合。这个称为GENRL的框架,通过引入一个行动潜伏变量来培训深层政策,从而可以将进进向政策搜索分为两个部分:(一) 培训一个次级政策,在系统状态下对行动潜变数进行分布,以及(二) 未经监督地培训一个基因化模型,该模型可产生以潜在动作变量为条件的一系列运动动作。GENRL能够安全地探索并减轻数据效率低下的问题,因为它利用了以前对机动动作有效序列的知识。此外,我们提供了一套评估变异模型的措施,以便我们能够预测在实际培训物理机器人之前RL政策培训的绩效。我们实验性地确定基因化模型的特征,这些特征对两项机器人任务:射击曲棍球和扔篮球的最后政策培训的绩效影响最大。此外,我们从经验上证明GENRL是唯一能够安全、高效地解决机器人任务的方法。