In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that best correspond to images, and finding regions in images that correspond to phrases. Although there are some examples, the connection between these pre-trained vision-language models and robotics is still weak. If they are directly connected to robot motions, they lose their versatility due to the embodiment of the robot and the difficulty of data collection, and become inapplicable to a wide range of bodies and situations. Therefore, in this study, we categorize and summarize the methods to utilize the pre-trained vision-language models flexibly and easily in a way that the robot can understand, without directly connecting them to robot motions. We discuss how to use these models for robot motion selection and motion planning without re-training the models. We consider five types of methods to extract information understandable for robots, and show the results of state recognition, object recognition, affordance recognition, relation recognition, and anomaly detection based on the combination of these five methods. We expect that this study will add flexibility and ease-of-use, as well as new applications, to the recognition behavior of existing robots.
翻译:近年来,从大型数据集中学习视觉和语言关系的一些模型已经发布。这些模型执行各种任务,例如回答图像问题,检索最符合图像的句子,在图像中找到与短语相对应的区域。虽然有一些实例,但这些经过预先训练的视觉语言模型和机器人之间的联系仍然薄弱。如果这些模型与机器人运动直接相关,则由于机器人的化身和数据收集的困难而失去其多功能性,并变得不适用于各种各样的机构和情况。因此,在本研究中,我们分类和总结如何以机器人能够理解的方式灵活和容易地使用经过训练的视觉语言模型,而不必直接将其与机器人运动运动联系起来。我们讨论如何将这些模型用于机器人运动的选择和运动规划,而不对模型进行再培训。我们考虑五类方法,以获得机器人可以理解的信息,并显示国家认识、目标识别、承诺度、关系识别和基于这五种方法组合的异常性检测结果。我们期望,这一研究将增加现有机器人的灵活性和轻松度,作为新的认识,作为新的认识,作为新的认识,作为新的认识,我们期望新的认识,作为新的认识,作为新的认识,作为新的认识,作为新的应用。</s>