从对以往行动的场景和Q价值预测的视觉观察中学习多步、多步、机器人操纵政策 (Learning Multi-step Robotic Manipulation Policies from Visual Observation of Scene and Q-value Predictions of Previous Action)

from arxiv, 7 pages, 3 figures, IEEE Conference on Robotics and Automation (ICRA) 2022. arXiv admin note: substantial text overlap with arXiv:2103.01434

In this work, we focus on multi-step manipulation tasks that involve long-horizon planning and considers progress reversal. Such tasks interlace high-level reasoning that consists of the expected states that can be attained to achieve an overall task and low-level reasoning that decides what actions will yield these states. We propose a sample efficient Previous Action Conditioned Robotic Manipulation Network (PAC-RoManNet) to learn the action-value functions and predict manipulation action candidates from visual observation of the scene and action-value predictions of the previous action. We define a Task Progress based Gaussian (TPG) reward function that computes the reward based on actions that lead to successful motion primitives and progress towards the overall task goal. To balance the ratio of exploration/exploitation, we introduce a Loss Adjusted Exploration (LAE) policy that determines actions from the action candidates according to the Boltzmann distribution of loss estimates. We demonstrate the effectiveness of our approach by training PAC-RoManNet to learn several challenging multi-step robotic manipulation tasks in both simulation and real-world. Experimental results show that our method outperforms the existing methods and achieves state-of-the-art performance in terms of success rate and action efficiency. The ablation studies show that TPG and LAE are especially beneficial for tasks like multiple block stacking. Additional experiments on Ravens-10 benchmark tasks suggest good generalizability of the proposed PAC-RoManNet.

翻译：在这项工作中,我们侧重于涉及长方位网络规划和审议进展逆转的多步操纵任务;这些任务相互交织,具有高层次的推理,包括能够实现总体任务和低层次推理的预期国家,从而决定哪些行动将产生哪些结果;我们提出一个抽样高效的前行动改良机器人操纵网络(PAC-RomanNet),以学习行动价值功能,预测从对现场的视觉观察和对先前行动的行动的动作价值预测中选择操纵行动的对象;我们根据高斯扬(TPG)任务进度的奖励功能,根据导致运动成功、原始和朝向总体任务目标进展的行动来计算奖励。为了平衡勘探/开发的比例,我们采用了一个调整损失的探索(LAE)政策,根据博尔茨曼的损失估计分布来决定行动候选人的行动。我们通过培训PAC-RomanNet,在模拟和实际世界中学习若干具有挑战性的多步机器人操纵任务,来证明我们的方法的有效性。实验结果显示,我们的方法超越了当前行动效率水平,并显示目前的行动标准,表明LA-RO-RO-RO-RO的进度的进度成功率,显示,显示,表明,表明,在常规的进度上,并且展示了常规的成绩和常规的成绩的成绩的成绩上,表明,表现表现也特别优于成功。