Learning representations in the joint domain of vision and touch can improve manipulation dexterity, robustness, and sample-complexity by exploiting mutual information and complementary cues. Here, we present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning. Our approach extends the Visual Transformer \cite{dosovitskiy2021image} to handle visuo-tactile feedback. Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain. We demonstrate the efficacy of VTT for representation learning with a comparative evaluation against baselines on four simulated robot tasks and one real world block pushing task. We conduct an ablation study over the components of VTT to highlight the importance of cross-modality in representation learning.
翻译:视觉和触摸联合领域的学习表现可以通过利用相互信息和互补提示,改进操纵的灵巧性、稳健性和样本复杂性。在这里,我们介绍Visuo-Tactile变异器(VTTs),这是一种适合基于模型的强化学习和规划的新型多式代表学习方法。我们的方法将视觉变异器(cite{dosovitskiy2021image})扩展至处理触动反馈。具体地说,VTT使用触动反馈以及自我和跨式关注来建立潜在的热映像,集中关注视觉领域的重要任务特征。我们展示了VTT在代表学习方面的功效,对照四个模拟机器人任务和一个真实的世界块推力任务的基准进行对比评估。我们对VTTT各组成部分进行模拟研究,以突出跨模式学习的重要性。