In recent years, the robotics community has made substantial progress in robotic manipulation using deep reinforcement learning (RL). Effectively learning of long-horizon tasks remains a challenging topic. Typical RL-based methods approximate long-horizon tasks as Markov decision processes and only consider current observation (images or other sensor information) as input state. However, such approximation ignores the fact that skill-sequence also plays a crucial role in long-horizon tasks. In this paper, we take both the observation and skill sequences into account and propose a skill-sequence-dependent hierarchical policy for solving a typical long-horizon task. The proposed policy consists of a high-level skill policy (utilizing skill sequences) and a low-level parameter policy (responding to observation) with corresponding training methods, which makes the learning much more sample-efficient. Experiments in simulation demonstrate that our approach successfully solves a long-horizon task and is significantly faster than Proximal Policy Optimization (PPO) and the task schema methods.
翻译:近年来,机器人界在利用深强化学习(RL)进行机器人操纵方面取得了很大进展。有效学习长方位数任务仍然是一个具有挑战性的议题。典型的基于RL的方法将长方位数任务与Markov决定程序相近,而仅将当前观测(图像或其他传感器信息)视为输入状态。然而,这种近似忽略了技能序列在长方位任务中也起着关键作用这一事实。在本文中,我们既考虑到观察和技能序列,又提出了解决典型的长方位任务的技术序列排序政策。拟议的政策包括高层次技能政策(利用技能序列)和低级参数政策(对应观察),以及相应的培训方法,这使得学习样本效率更高。模拟实验表明,我们的方法成功地解决了长方位任务,而且远远快于普罗克西米勒政策优化(PPO)和任务规划方法。