Despite decades of research, understanding human manipulation activities is, and has always been, one of the most attractive and challenging research topics in computer vision and robotics. Recognition and prediction of observed human manipulation actions have their roots in the applications related to, for instance, human-robot interaction and robot learning from demonstration. The current research trend heavily relies on advanced convolutional neural networks to process the structured Euclidean data, such as RGB camera images. These networks, however, come with immense computational complexity to be able to process high dimensional raw data. Different from the related works, we here introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs, instead of relying on the structured Euclidean data. Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs. The input of the proposed network is a set of semantic graphs which store the spatial relations between subjects and objects in the scene. The network output is a label set representing the detected and predicted class types. We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance. We also release our source code https://github.com/gamzeakyol/GNet.
翻译:尽管进行了数十年的研究,了解人类操纵活动是而且一直是计算机视觉和机器人中最有吸引力和最具挑战性的研究课题之一。观察到的人类操纵行动的识别和预测,其根源在于与人类-机器人互动和从演示中学习机器人有关的应用。目前的研究趋势严重依赖先进的超演神经神经网络处理结构化的Euclidean数据,如 RGB 相机图像。然而,这些网络具有巨大的计算复杂性,能够处理高维的原始数据。与相关作品不同,我们在此引入了一个深图自动编码器,以共同从象征性场景图中学习对操作任务的识别和预测,而不是依赖结构化的Euclidean数据。我们的网络有一个变式自动编码结构,有两个分支:一个是确定输入图形类型,一个是预测未来图表。拟议网络的输入是一套包含主题和对象之间空间关系的语系图。网络输出是一个标签组,代表所检测到和预测的类别模型类型。我们网络的标签组,而不是依赖结构化的Euclideencoder。我们网络有一个新的模型,用来根据不同的州-MS-ARC显示不同的数据源。我们的新模型,也可以展示了我们的新模型和不同的版本。我们的新模型,可以展示。