Planar grasp detection is one of the most fundamental tasks to robotic manipulation, and the recent progress of consumer-grade RGB-D sensors enables delivering more comprehensive features from both the texture and shape modalities. However, depth maps are generally of a relatively lower quality with much stronger noise compared to RGB images, making it challenging to acquire grasp depth and fuse multi-modal clues. To address the two issues, this paper proposes a novel learning based approach to RGB-D grasp detection, namely Depth Guided Cross-modal Attention Network (DGCAN). To better leverage the geometry information recorded in the depth channel, a complete 6-dimensional rectangle representation is adopted with the grasp depth dedicatedly considered in addition to those defined in the common 5-dimensional one. The prediction of the extra grasp depth substantially strengthens feature learning, thereby leading to more accurate results. Moreover, to reduce the negative impact caused by the discrepancy of data quality in two modalities, a Local Cross-modal Attention (LCA) module is designed, where the depth features are refined according to cross-modal relations and concatenated to the RGB ones for more sufficient fusion. Extensive simulation and physical evaluations are conducted and the experimental results highlight the superiority of the proposed approach.
翻译:平板定位探测是机器人操纵的最根本任务之一,消费者级 RGB-D 传感器最近的进展使得从质谱和形状模式中能够提供更全面的特征,然而,深度地图的质量一般相对较低,与RGB图像相比,噪音要大得多,因此难以获得抓取深度和连接多式线索。为了解决这两个问题,本文件建议对RGB-D 抓取检测采取基于学习的新办法,即深度引导跨式注意网络(DGCAN)。为了更好地利用深度通道记录的几何信息,除了共同的五维一中定义的外,还专门考虑的抓住深度,采用完整的六维矩形代表制。对额外抓取深度的预测大大加强了特征学习,从而导致更准确的结果。此外,为了减少数据质量差异在两种模式中造成的负面影响,设计了一个本地交叉式注意模块,根据交叉模式关系和与RGB 连接的深度特征加以改进,除了共同的五维面图像定义的深度外,还专门考虑采用完整的六维矩代表制。对额外深度的深度的预测大大加强了特征学习,从而导致更准确的结果。此外,还设计了一个广泛的模拟和物理评价,并试验了拟议的优势。</s>