Learning to produce contact-rich, dynamic behaviors from raw sensory data has been a longstanding challenge in robotics. Prominent approaches primarily focus on using visual or tactile sensing, where unfortunately one fails to capture high-frequency interaction, while the other can be too delicate for large-scale data collection. In this work, we propose a data-centric approach to dynamic manipulation that uses an often ignored source of information: sound. We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using commodity contact microphones. Then, given this data, we leverage self-supervised learning to accelerate behavior prediction from sound. Our experiments indicate that this self-supervised 'pretraining' is crucial to achieving high performance, with a 34.5% lower MSE than plain supervised learning and a 54.3% lower MSE over visual training. Importantly, we find that when asked to generate desired sound profiles, online rollouts of our models on a UR10 robot can produce dynamic behavior that achieves an average of 11.5% improvement over supervised learning on audio similarity metrics.
翻译:从原始感官数据中学习接触丰富、动态的行为一直是机器人的一个长期挑战。 突出的方法主要侧重于使用视觉或触觉感测, 不幸的是,在视觉或触觉感测中,一个人未能捕捉到高频互动, 而另一人对于大规模数据收集可能太敏感。 在这项工作中, 我们建议对动态操纵采取以数据为中心的方法, 使用经常被忽视的信息来源: 声音。 我们首先用商品接触麦克风来收集五种动态任务之间25K互动声对方的数据集。 然后, 根据这些数据, 我们利用自我监督的学习来加速声音中的行为预测。 我们的实验表明, 这个自我监督的“ 预科” 是实现高性工作的关键, 其34.5%的MSE 比普通监督的学习低, 54.3%的MSE 比视觉培训低。 重要的是, 我们发现, 当被要求生成所需的声音描述时, 我们模型在 UR 10 机器人上的在线推出可以产生动态行为, 从而实现平均11.5%的改进率超过音频相似度测量的学习。