Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks (CNN), the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect our fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. Our codes and dataset are public on GitHub.
翻译:可靠的机器人捕捉,特别是水果等变形物体的捕捉,仍然是一项艰巨的任务,因为与牵引器、不明物体动态和地貌特征的接触互动不足,因此,由于接触互动不足,与牵引器、不明物体动态和地貌特征的接触不足,因此,我们在本研究中提议为僵硬的抓抓器建立一个基于变压器的机器人捕捉框架,以利用触动和视觉信息来安全捕捉物体。具体地说,变压器模型通过执行两个预先定义的探索动作(缓冲和滑动),学习带有传感器反馈的物理特征,并预测通过一个具有一定掌握力的多层透视器(MLP)捕捉结果。利用这些预测,拉压器预测了一种安全的捕捉能力。与变压式经常网络相比,变压器模型可以同时捕捉到图像序列和过程空间时空特征之间的长期依赖性。我们首先将变压器模型以公共数据集为基准,然后将变压器模型比成一个具有一定掌握准确性和计算效率的CNN+LSTM模型。我们还利用了最新的常规数据,同时收集了我们的数据,我们也看到了对数据库。