Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.
翻译:学习使用强化学习(RL)通过视觉反馈解决基于精密的操作任务。 使用强化学习(RL)通过视觉反馈解决精密的操作任务可以大幅降低传统机器人系统所需的工程工作。 但是,单凭视觉输入进行精细的发动机控制是具有挑战性的,特别是以前工作中经常使用的静态第三人相机。 我们建议一个机器人操作环境,使代理人既从第三人相机获得视觉反馈,又用安装在机器人手腕上的以自我为中心的照相机。 第三人相机是静态的,而以自我为中心的照相机使机器人能够积极控制其视觉,以帮助精确操作。 为了有效地将两个相机的视觉信息整合起来,我们提议使用一个交叉关注机制,将空间注意力从一个角度模拟到另一个角度(反向反向)的变异变换器,并将所学的功能作为RL政策的投入。 我们的方法改进了对强度的单视和多视基线的学习,并成功地将一个真实机器人的具有挑战性的操作任务转换成一套操作任务任务。 在一项锤操纵任务任务的任务中,我们的方法在75%的基线和1:38%之间成功完成了。