Procedure-Aware Pretraining for Instructional Video Understanding (Procedure-Aware Pretraining for Instructional Video Understanding)

Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika.

翻译：我们的目标是学习一个视频表示方法，该方法对下游的教学视频理解任务有用。由于可用注释数量较少，因此在过程理解中的一个关键挑战是能够从未标记的视频中提取过程知识，例如任务的身份（例如，“做拿铁”），其步骤（例如，“倒牛奶”）或在执行过程中给出部分进度的潜在下一步。我们的主要见解在于教学视频展示了同一或不同任务之间重复发生的步骤序列，而这个结构可以由一个过程知识图（PKG）很好地表示，其中节点是离散的步骤，而边连接在教学活动中按顺序发生的步骤。然后可以使用这个图生成伪标签，以训练编码过程知识的视频表示，以更易于使用的形式广泛适用于多个过程理解任务。我们通过将基于文本的过程知识数据库和未标记的教学视频语料库的信息组合来构建PKG，然后使用它来生成四个新的预训练目标的训练伪标签。我们称这种PKG-based预训练过程为Paprika，即过程感知的教学知识获取的预训练。我们在COIN和CrossTask上评估了Paprika的任务识别、步骤识别和步骤预测等过程理解任务。Paprika产生了一个超越现有技术的视频表示：在12个评估设置中的精度提高了高达11.23％。实现可在https://github.com/salesforce/paprika中找到。