Deep reinforcement learning (DRL) frameworks are increasingly used to solve high-dimensional continuous-control tasks in robotics. However, due to the lack of sample efficiency, applying DRL for online learning is still practically infeasible in the robotics domain. One reason is that DRL agents do not leverage the solution of previous tasks for new tasks. Recent work on multi-tasking DRL agents based on successor features has proven to be quite promising in increasing sample efficiency. In this work, we present a new approach that unifies two prior multi-task RL frameworks, SF-GPI and value composition, for the continuous control domain. We exploit compositional properties of successor features to compose a policy distribution from a set of primitives without training any new policy. Lastly, to demonstrate the multi-tasking mechanism, we present a new benchmark for multi-task continuous control environment based on Raisim. This also facilitates large-scale parallelization to accelerate the experiments. Our experimental results in the Pointmass environment show that our multi-task agent has single task performance on par with soft actor critic (SAC) and the agent can successfully transfer to new unseen tasks where SAC fails. We provide our code as open-source at https://github.com/robot-perception-group/concurrent_composition for the benefit of the community.
翻译:深度强化学习(DRL)框架在机器人领域中越来越多地用于解决高维度的连续控制任务。然而,由于缺乏样本效率,将DRL应用于在线学习在实践上仍然不可行。一个原因是DRL代理不会利用先前任务的解决方案来处理新任务。最近的多任务DRL代理基于继承特征已被证明在提高样本效率方面非常有前途。在这项工作中,我们提出了一种新方法,将连续控制领域中的两个先前的多任务强化学习框架SF-GPI和值组合统一起来。我们利用继承特征的组合属性,从原语集合中组合出一个策略分布,而不需要训练任何新的策略。最后,为了演示多任务机制,我们基于Raisim提出了一个新的多任务连续控制环境的基准。这也方便了大规模并行化来加速实验。我们在Pointmass环境中的实验结果表明,我们的多任务代理在单任务性能上与软演员评论家(SAC)持平,并且代理可以成功地转移到SAC失败的新未见任务。我们将我们的代码作为开源分享在https://github.com/robot-perception-group/concurrent_composition,以造福社区。