Handheld grippers are increasingly used to collect human demonstrations due to their ease of deployment and versatility. However, most existing designs lack tactile sensing, despite the critical role of tactile feedback in precise manipulation. We present a portable, lightweight gripper with integrated tactile sensors that enables synchronized collection of visual and tactile data in diverse, real-world, and in-the-wild settings. Building on this hardware, we propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving their distinct characteristics. The learning procedure allows the emergence of interpretable representations that consistently focus on contacting regions relevant for physical interactions. When used for downstream manipulation tasks, these representations enable more efficient and effective policy learning, supporting precise robotic manipulation based on multimodal feedback. We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer, demonstrating improved accuracy and robustness under external disturbances. Our project page is available at https://binghao-huang.github.io/touch_in_the_wild/ .
翻译:手持式夹爪因其易于部署和多功能性,越来越多地被用于收集人类演示数据。然而,尽管触觉反馈在精确操作中起着关键作用,现有的大多数设计仍缺乏触觉感知能力。我们提出了一种便携、轻量的夹爪,其集成了触觉传感器,能够在多样化的真实世界及野外环境中同步采集视觉与触觉数据。基于此硬件,我们提出了一种跨模态表征学习框架,该框架整合了视觉与触觉信号,同时保留了它们各自的特征。该学习过程能够产生可解释的表征,这些表征持续聚焦于与物理交互相关的接触区域。当应用于下游操作任务时,这些表征能够实现更高效、更有效的策略学习,支持基于多模态反馈的精确机器人操作。我们在细粒度任务(如试管插入和移液器液体转移)上验证了所提方法,结果表明其在外部干扰下具有更高的准确性和鲁棒性。项目页面请访问:https://binghao-huang.github.io/touch_in_the_wild/。