Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks, the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect a new fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. Our codes and dataset are public on GitHub.
翻译:可靠的机器人捕捉,特别是水果等变形物体的可靠机器人捕捉,仍然是一项具有挑战性的任务,因为与牵引器、不明物体动态和地貌特征的接触互动作用不足。在本研究中,我们提议为僵硬抓抓器建立一个基于变压器的机器人捕捉框架,以利用触动和视觉信息来安全捕捉物体。具体地说,变压器模型通过执行两个预先定义的探索动作(缓冲和滑动)来学习带有传感器反馈的物理特征,并预测通过一个具有一定掌握力的多层透视器(MLP)来捕捉结果。利用这些预测,拉压器通过推断预测来预测一种安全的捕捉力。与变压器经常网络相比,变压器模型可以同时捕捉图像序列和过程空间时空特征之间的长期依赖性。我们首先将变压器模型以公共数据集作为基准,以便探测滑动。随后,我们显示变压器模型在掌握准确性和计算效率方面超过了CNN+LSTMMMMT模型。我们还收集了一个新的水果捕获和在线数据,同时利用了我们所看到的数据模型。我们所看到的数据,我们正在收集和进行。