In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works require the actual order of procedure steps in the video, to be provided by human annotators at both training and test times. Instead, here, we only rely on generic procedural text that is not tied to a specific video. We represent the various ways to complete the procedure by transforming the list of instructions into a procedure flow graph which captures the partial order of steps. Using the flow graphs reduces both training and test time annotation requirements. To this end, we introduce the new problem of flow graph to video grounding. In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video. To solve this problem, we propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them. To show the advantage of our proposed formulation, we extend the CrossTask dataset with procedure flow graph information. Our experiments show that Graph2Vid is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation.
翻译:在这项工作中,我们考虑到在教学视频中监管不力的多步本地化问题; 对这个问题的既定办法是依赖一个特定步骤清单。 然而,在现实中,通常有不止一种方法能够成功地执行程序,办法是遵循一套步骤,按略微不同的顺序执行。 因此,为了在特定视频中成功本地化,最近的工作需要视频中的实际程序顺序步骤,由人类标识员在培训和测试时间提供。 相反,我们只依靠不与具体视频挂钩的通用程序文本。我们代表了完成程序的各种方法,将指示列表转换成一个程序流程图,以捕捉部分步骤顺序。使用流程图减少了培训和测试时间说明要求。为此,我们向视频定位引入了新的流程图问题。在这个设置中,我们寻求与程序流程图和给定的视频视频同步一致的最佳步骤。为了解决这一问题,我们建议一种新的算法 - 图表2 Vid - 用来推断视频流程中步骤的实际排序的强势, 并且同时显示我们图表的流程中的进度, 显示我们所拟议的图表的进度图的进度是更精确的进度。