The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering. We empirically demonstrate that learning temporal ordering not only enables new capabilities for procedure reasoning, but also reinforces the recognition of individual steps. Our model significantly advances the state-of-the-art results on step classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting (+7.4% on COIN). Moreover, our model attains promising results in zero-shot inference for step classification and forecasting, as well as in predicting diverse and plausible steps for incomplete procedures. Our code is available at https://github.com/facebookresearch/ProcedureVRL.
翻译:互联网上大量的说明视频及其叙述为理解过程活动提供了激动人心的途径。本研究提出了一种方法,在不使用人工标注的情况下,基于大规模的网页说明视频及其叙述数据集,学习视频表示,以编码各个动作步骤及其时间顺序。我们的方法联合学习视频表示,以编码各个步骤概念和深度概率模型,以捕获步骤顺序中的时间依赖性和巨大个体变异性。我们在实证上证明,学习时间顺序不仅为过程推理提供了新的能力,而且加强了对个别步骤的识别。我们的模型显著提高了步骤分类(COIN/EPIC-Kitchens上分别增加了2.8%/3.3%)和步骤预测(COIN上增加了7.4%)的最新结果。此外,我们的模型在零样例推理步骤分类和预测,及在预测不完整过程的各种可行步骤方面也有着良好的表现。我们的代码可在https://github.com/facebookresearch/ProcedureVRL 上获得。