This paper focuses on task recognition and action segmentation in weakly-labeled instructional videos, where only the ordered sequence of video-level actions is available during training. We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos. Further, we present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences. Experimental results on the popular Breakfast and Cooking 2 datasets show that our two-stream hierarchical task modeling significantly outperforms existing methods in top-level task recognition for all datasets and metrics. Additionally, using our task recognition framework in the proposed top-down action segmentation approach consistently improves the state of the art, while also reducing segmentation inference time by 80-90 percent.
翻译:本文侧重于在标签不高的教学视频中的任务识别和行动分解, 培训期间只有有顺序的视频行动序列。 我们提出了一个双流框架, 利用语义和时间等级来识别教学视频中的顶级任务。 此外, 我们提出了一个新的自上而下、 监督不力的行动分解方法, 该方法的预测任务被用来限制微微分动作序列的推论。 流行的早餐和烹调2数据集的实验结果显示, 我们的双流分级任务模型大大优于所有数据集和计量的顶级任务识别的现有方法。 此外, 利用拟议的自上而下行动分解方法中的任务识别框架, 不断改善艺术状态, 同时将分解时间减少80- 90%。