Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a desired expected return," our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. Our code is public.
翻译:以目标为条件的强化学习(RL) 旨在学习最佳政策,在特殊指令投入中将目标编码为目标。 我们在这里研究以目标为条件的神经网(NNs),以学习以特定环境重量矩阵的形式产生深度的NN政策,类似于1990年代的快速重力程序员和其他方法。使用“生成一个预期回报的政策”的形式的上下文指令,我们的NN发电机将强力探索参数空间和跨命令的通用结合起来,以便迭接地找到更好更好的政策。 一种形式的权重共享超网络和政策嵌入我们产生深度NNN的方法。 实验显示单个学习的政策生成者如何产生在培训中看到的任何回报的政策。 最后,我们评估一系列持续控制任务的算法,只要它表现出竞争性的绩效。 我们的代码是公开的 。