Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
翻译:在零售商店、餐馆和家庭等以人为中心的环境中运作的机器人往往需要在不同背景下对类似物体进行区分,并具有高度准确性。然而,由于类别内和类别间差异性差高和低,微微颗粒天体识别在机器人方面仍是一个挑战。此外,微粒的3D数据集数量有限,对有效解决这一问题造成了严重问题。在本文件中,我们提议采用混合多模式愿景变异器(VIT)和动态神经网络(CNN)方法来改进精细视觉分类(FGVC)的性能。为了解决FGVC 3D数据集短缺的问题,我们生成了两个合成数据集。第一个数据集由20个类别组成,涉及餐饮业,总共100个,而第二个数据集则包含120个鞋例。我们在这两个数据集上采用的方法得到了评估,并且拟议的结果表明,它超越了CNN和ViT专用基线,从而实现了对94.5和93.51%的精准度图像的精确度,在餐厅和鞋类实验中分别实现了我们所展示的R51%和9-D工具。</s>