Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look at PSS and propose three fundamental improvements over current methods. The segmentation task is critical, as generating a correct summary requires the step to be identified first. However, current segmentation metrics often overestimate the segmentation quality because they do not incorporate the temporal order of segments. We propose a new segmentation metric based on dynamic programming that takes into account the order of segments. Current PSS methods are typically trained by proposing segments, matching them with the ground truth and computing a loss. However, much like segmentation metrics, existing matching algorithms do not consider the temporal order of the mapping between candidate segments and the ground truth. We propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable. Lastly, we introduce multi-modal feature training for PSS, which further improves segmentation. We evaluate our approach on two instructional video datasets (YouCook2 and Tasty) and improve the state of the art by a margin of $\sim7\%$ and $\sim2.5\%$ for procedure segmentation and summarization, respectively.
翻译:理解执行任务所需的步骤是AI 系统的一项重要技能。 从教学视频中学习这些步骤需要两个小问题:(一) 确定相继发生的段段段的暂时界限,以及(二) 以自然语言概括这些步骤。我们把这项任务称为程序分解和总结。在本文中,我们更仔细地审视PSS,并提议对当前方法进行三项基本改进。分解任务至关重要,因为产生正确的摘要需要先确定一个步骤。然而,当前分解指标往往高估分解质量,因为它们不包含区段的时间顺序。我们提出一个新的分解指标,以动态程序为基础,考虑到各段的顺序。目前的PSS方法通常通过提出部分,将其与地面真相相匹配和计算损失来培训。但是,与分解指标一样,现有的匹配算法并不考虑候选人段段和地面真相之间的时间顺序。我们建议一种匹配的算法,限制分段段段段段段段段的时间顺序,而且也是可以区分的。最后,我们提出了一种基于动态程序进行多模式的分解指标培训,通过改进我们的分段段段段段和分解方法。