We present a hierarchical planning and control framework that enables an agent to perform various tasks and adapt to a new task flexibly. Rather than learning an individual policy for each particular task, the proposed framework, DISH, distills a hierarchical policy from a set of tasks by representation and reinforcement learning. The framework is based on the idea of latent variable models that represent high-dimensional observations using low-dimensional latent variables. The resulting policy consists of two levels of hierarchy: (i) a planning module that reasons a sequence of latent intentions that would lead to an optimistic future and (ii) a feedback control policy, shared across the tasks, that executes the inferred intention. Because the planning is performed in low-dimensional latent space, the learned policy can immediately be used to solve or adapt to new tasks without additional training. We demonstrate the proposed framework can learn compact representations (3- and 1-dimensional latent states and commands for a humanoid with 197- and 36-dimensional state features and actions) while solving a small number of imitation tasks, and the resulting policy is directly applicable to other types of tasks, i.e., navigation in cluttered environments. Video: https://youtu.be/HQsQysUWOhg
翻译:我们提出了一个等级规划和控制框架,使代理人能够灵活地执行各种任务和适应新的任务。拟议框架(DISH)不是为每项具体任务学习个别政策,而是从一组代表和强化学习的任务中提炼出一个等级政策。框架的基础是代表利用低维潜伏变量进行高维观测的潜伏变量模型的设想。由此形成的政策由两个等级层次组成:(一) 规划模块,说明导致产生一个乐观未来的潜在意图的顺序;(二) 反馈控制政策,在各项任务之间共享,执行推断的意向。由于规划是在低维潜空进行的,因此,所学的政策可以立即用于解决或适应新的任务,而无需额外培训。我们展示拟议框架可以学习具有197和36维状态特征和行动的人类结构的缩略表(3和1维潜伏状态和指令),同时解决少量的模仿任务,由此产生的政策可直接适用于其他任务,即封闭环境中的导航。视频:https://youtu.Be/HQyQ。