Tactile representation learning (TRL) equips robots with the ability to leverage touch information, boosting performance in tasks such as environment perception and object manipulation. However, the heterogeneity of tactile sensors results in many sensor- and task-specific learning approaches. This limits the efficacy of existing tactile datasets, and the subsequent generalisability of any learning outcome. In this work, we investigate the applicability of vision foundational models to sensor-agnostic TRL, via a simple yet effective transformation technique to feed the heterogeneous sensor readouts into the model. Our approach recasts TRL as a computer vision (CV) problem, which permits the application of various CV techniques for tackling TRL-specific challenges. We evaluate our approach on multiple benchmark tasks, using datasets collected from four different tactile sensors. Empirically, we demonstrate significant improvements in task performance, model robustness, as well as cross-sensor and cross-task knowledge transferability with limited data requirements.
翻译:触感表示学习(TRL)为机器人提供了利用触摸信息的能力,提高了环境感知和物体操纵等任务的性能。然而,触觉传感器的异质性导致了许多传感器和任务特定的学习方法。这限制了现有触觉数据集的有效性,以及任何学习结果的后续泛化能力。在本研究中,我们通过一种简单但有效的转化技巧,将异构的传感器读数输入模型中,从而探索视觉基础模型在不受传感器限制的TRL中的适用性。我们的方法将TRL重新定义为计算机视觉(CV)问题,这允许应用各种CV技术来解决TRL特定的挑战。我们使用从四个不同的触觉传感器收集的数据集来评估我们的方法,在多个基准任务上进行了实证评估。经验上,我们证明了在任务性能、模型鲁棒性以及数据要求有限的情况下跨传感器和跨任务的知识传递性方面都有显著改进。