Such human-assisting systems as robots need to correctly understand the surrounding situation based on observations and output the required support actions for humans. Language is one of the important channels to communicate with humans, and the robots are required to have the ability to express their understanding and action planning results. In this study, we propose a new task of operative action captioning that estimates and verbalizes the actions to be taken by the system in a human-assisting domain. We constructed a system that outputs a verbal description of a possible operative action that changes the current state to the given target state. We collected a dataset consisting of two images as observations, which express the current state and the state changed by actions, and a caption that describes the actions that change the current state to the target state, by crowdsourcing in daily life situations. Then we constructed a system that estimates operative action by a caption. Since the operative action's caption is expected to contain some state-changing actions, we use scene-graph prediction as an auxiliary task because the events written in the scene graphs correspond to the state changes. Experimental results showed that our system successfully described the operative actions that should be conducted between the current and target states. The auxiliary tasks that predict the scene graphs improved the quality of the estimation results.
翻译:作为机器人的人类协助系统需要根据观察和输出结果正确理解周围环境, 人类所需的支持行动。 语言是与人类沟通的重要渠道之一, 机器人需要有能力表达理解和行动规划结果。 在此研究中, 我们提议一项新的操作行动任务, 说明如何估计和口头说明系统在人力协助领域要采取的行动。 我们构建了一个系统, 以口头形式描述可能的行动, 将当前状态改变为给定目标状态。 我们收集了一个数据集, 由两种图像组成, 作为观察, 显示当前状态和因行动而改变的状况, 以及描述将当前状态改变为目标状态的行动, 在日常生活中, 以众包形式展示。 然后我们构建了一个系统, 用一个说明来估计运行行动。 由于操作说明预计将包含一些状态变化的行动, 我们使用现场绘图预测作为辅助任务, 因为场景中写的事件与状态变化相符。 实验结果显示, 我们的系统成功地描述了当前和目标状态之间应该改进的结果。