We address the following action-effect prediction task. Given an image depicting an initial state of the world and an action expressed in text, predict an image depicting the state of the world following the action. The prediction should have the same scene context as the input image. We explore the use of the recently proposed GLIDE model for performing this task. GLIDE is a generative neural network that can synthesize (inpaint) masked areas of an image, conditioned on a short piece of text. Our idea is to mask-out a region of the input image where the effect of the action is expected to occur. GLIDE is then used to inpaint the masked region conditioned on the required action. In this way, the resulting image has the same background context as the input image, updated to show the effect of the action. We give qualitative results from experiments using the EPIC dataset of ego-centric videos labelled with actions.
翻译:我们处理的是以下行动效果预测任务。 根据描绘世界初始状态的图像和文字表达的动作, 预测一个描绘行动后世界状况的图像。 预测应该与输入图像具有相同的场景背景。 我们探索使用最近提议的 GLIDE 模型来完成这项任务。 GLIDE 是一个基因神经网络, 能够以短文本为条件, 合成( 油漆) 遮蔽图像遮蔽图像的遮蔽区。 我们的想法是遮蔽一个输入图像的区域, 预计行动的效果会在那里发生。 GLIDE 然后用所需动作来涂抹遮蔽区域。 这样, 产生的图像具有与输入图像相同的背景背景, 并更新以显示动作的效果。 我们用EPIC 的以自我为中心的视频标有动作的数据集进行实验, 我们从实验中得出质量结果 。