混合操纵的多式联运观念 (Multimodal perception for dexterous manipulation)

Humans usually perceive the world in a multimodal way that vision, touch, sound are utilised to understand surroundings from various dimensions. These senses are combined together to achieve a synergistic effect where the learning is more effectively than using each sense separately. For robotics, vision and touch are two key senses for the dexterous manipulation. Vision usually gives us apparent features like shape, color, and the touch provides local information such as friction, texture, etc. Due to the complementary properties between visual and tactile senses, it is desirable for us to combine vision and touch for a synergistic perception and manipulation. Many researches have been investigated about multimodal perception such as cross-modal learning, 3D reconstruction, multimodal translation with vision and touch. Specifically, we propose a cross-modal sensory data generation framework for the translation between vision and touch, which is able to generate realistic pseudo data. By using this cross-modal translation method, it is desirable for us to make up inaccessible data, helping us to learn the object's properties from different views. Recently, the attention mechanism becomes a popular method either in visual perception or in tactile perception. We propose a spatio-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration. Our proposed method not only pays attention to the salient features in each spatial feature, but also models the temporal correlation in the through the time. The obvious improvement proves the efficiency of our selective attention mechanism. The spatio-temporal attention method has potential in many applications such as grasping, recognition, and multimodal perception.

翻译：人类通常以多式的方式看待世界, 视觉、触摸、声音被利用来理解不同层面的周围环境。这些感官被结合在一起, 以便实现协同效应, 使学习比分别使用每种感官更有效。对于机器人、视觉和触摸是极易操作的两种关键感官。视觉通常给我们带来形状、颜色和触摸等明显的特征, 例如摩擦、质感等当地信息。由于视觉和触觉之间的互补性, 我们最好将视觉和触觉结合起来, 以便从不同层面理解和操控中了解物体的特性。许多感官和触觉被结合在一起, 以达到协同学习、协同感知、 3D 重建、以视觉和触觉进行多式翻译等多式感知效果的效果。具体地说, 我们提出一个跨式感知感官数据生成框架, 能够产生现实的假数据。利用这种跨式翻译方法, 我们最好能够编造出无法获取的数据, 帮助我们从不同视角中了解物体的特性。最近, 注意力机制变成一种流行的方法, 要么是视觉感知觉或触地感官感官认知, 我们的感官的感官的感知, 我们的感官的感知, 我们的知觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉觉的感知, 我们的感知和感觉觉觉觉觉觉觉觉觉觉的感觉, 我们的体的感觉的感觉的感觉觉觉觉的感觉, 我们的体的感知, 我们的体的感觉觉觉觉觉觉的体的感官的体的体的感官的感觉觉觉觉觉觉觉觉觉觉觉觉觉觉的体的体的体的体的感觉觉觉觉觉觉觉觉的感。我们的体的感。的感的体的感的感觉觉, 我们的感的体的感的感觉觉觉觉, 我们的感的感的感的感的感的体的体的体的感的感的感的感的感的感的感的感的体的体的感的感觉觉觉觉觉觉觉觉觉觉觉的感觉, 我们的感觉觉觉, 我们的感觉觉