Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
翻译:人类利用所有感官在日常活动中完成不同的任务。 相反,现有的机器人操作工作主要依赖一种或偶尔两种模式,如视觉和触摸。在这项工作中,我们系统地研究视觉、听觉和触觉感知如何共同帮助机器人解决复杂的操作任务。我们建立一个机器人系统,它可以通过相机看到,用接触麦克风听到,用基于视觉的触觉感应传感器感觉,所有三种感知模式都与自我注意模式结合。关于两种具有挑战性的任务,即密集包装和灌注的结果,显示了机器人操作多感知感知的必要性和力量:视觉显示机器人的全球状态,但往往会因闭塞而受到影响,声音提供直接的反馈,甚至看不见的关键时刻,触摸为决策提供精确的本地几何测量。利用所有三种模式,我们的机器人系统大大超越了先前的方法。