Humans excel in grasping objects through diverse and robust policies, many of which are so probabilistically rare that exploration-based learning methods hardly observe and learn. Inspired by the human learning process, we propose a method to extract and exploit latent intents from demonstrations, and then learn diverse and robust grasping policies through self-exploration. The resulting policy can grasp challenging objects in various environments with an off-the-shelf parallel gripper. The key component is a learned intention estimator, which maps gripper pose and visual sensory to a set of sub-intents covering important phases of the grasping movement. Sub-intents can be used to build an intrinsic reward to guide policy learning. The learned policy demonstrates remarkable zero-shot generalization from simulation to the real world while retaining its robustness against states that have never been encountered during training, novel objects such as protractors and user manuals, and environments such as the cluttered conveyor.
翻译:人类能够通过多样和强大的策略来抓取物体,其中许多是难以进行探索学习的概率性罕见事件。受人类学习过程的启发,我们提出了一种方法,通过演示提取和利用潜在意图,然后通过自我探索学习多样和强大的抓取策略。所得到的策略可以通过一种现成的并行夹具抓取各种环境中具有挑战性的物体。关键组件是一个学习到的意图估计器,将夹具姿势和视觉传感映射到一组涵盖抓取运动重要阶段的子意图。子意图可以用来构建内在奖励以指导策略学习。学到的策略在从模拟到现实世界的零样本推广中表现出显著的通用性,同时保留其对于从未在训练过程中遇到的状态、新颖的物体(如量角器和用户手册)和环境(如混杂的传送带)的鲁棒性。