Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/2020/hypercrl
翻译:模型强化学习(MBRL)和模型预测控制(MPC)的有效规划取决于所学动态模型的准确性。在MBRL和MPC的许多情况下,这一模型假定是固定的,根据从环境互动一开始就收集的国家过渡经验,定期从零开始根据从环境互动中收集的国家过渡经验进行再培训。这意味着随着所收集经验的大小,动态模型培训所需的时间(以及计划执行之间所需的暂停时间)随着所收集经验的大小而线性增长。我们认为,对于终身机器人学习来说,这太慢了,并提议了超链接(HyperCRL),这种方法在任务序列中不断学习所遭遇到的动态动态动态。我们的方法有三个主要属性:第一,它包括不从以往任务中重新审查培训数据的动态学习课,因此只需要储存最近固定规模的州过渡经验部分;第二,它使用固定能力超网络代表非固定和任务认知的动态;第三,它优于现有依赖固定能力网络的不断学习替代品,并且以基准为竞争基准,在不断更新的服务器/服务器上记录我们不断更新的升级的轨道上,这是不断更新的机器人操作的机器人操作的轨道。我们显示,在不断更新的轨道上,在不断更新的轨道上不断更新的轨道上学习的轨道上学习的轨道上,在不断学习的轨道上进行。