Goal-oriented generative script learning aims to generate subsequent steps based on a goal, which is an essential task to assist robots in performing stereotypical activities of daily life. We show that the performance of this task can be improved if historical states are not just captured by the linguistic instructions given to people, but are augmented with the additional information provided by accompanying images. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 2,338 tasks and 31,496 steps with descriptive images. We aim to generate scripts that are visual-state trackable, inductive for unseen tasks, and diverse in their individual steps. We propose to encode visual state changes through a multimedia selective encoder, transferring knowledge from previously observed tasks using a retrieval-augmented decoder, and presenting the distinct information at each step by optimizing a diversity-oriented contrastive learning objective. We define metrics to evaluate both generation quality and inductive quality. Experiment results demonstrate that our approach significantly outperforms strong baselines.
翻译:面向目标的基因文字学习旨在产生基于一个目标的后续步骤,这是协助机器人开展日常生活中陈规定型活动的一项基本任务。我们表明,如果历史状态不仅被提供给人们的语言指示所捕捉,而且还通过附带图像提供的额外信息而得到充实,那么这项任务的绩效是可以改进的。因此,我们提议了一项新任务,即多媒体生成脚本学习,通过跟踪文字和愿景模式中的历史状态来产生后续步骤,同时提出第一个基准,其中包括2 338项任务和31 496项带有描述性图像的步伐。我们的目标是生成可视状态可追踪的脚本,用于隐形任务,且其个别步骤各异。我们提议通过多媒体选择性的编码来编码视觉状态的变化,利用一个检索式的解码器从以往所观察的任务中转让知识,并通过优化一个面向多样性的对比学习目标在每一步骤上提供独特的信息。我们界定了评估一代质量和感化质量的衡量标准。实验结果表明,我们的方法大大超越了坚实的基线。