Real-world tasks consist of multiple inter-dependent subtasks (e.g., a dirty pan needs to be washed before it can be used for cooking). In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task. This is a challenging problem since complete information about the world is often inaccessible from videos, which demands robust learning mechanisms to understand the causal structure of events. We present Multimodal Subtask Graph Generation (MSG2), an approach that constructs a Subtask Graph defining the dependency between a task's subtasks relevant to a task from noisy web videos. Graphs generated by our multimodal approach are closer to human-annotated graphs compared to prior approaches. MSG2 further performs the downstream task of next subtask prediction 85% and 30% more accurately than recent video transformer models in the ProceL and CrossTask datasets, respectively.
翻译:现实世界的任务由多个相互依存的子任务组成( 例如, 需要清洗一个肮脏的平板才能用于烹饪 ) 。 在这项工作中, 我们的目标是模拟描述任务的教学视频中的这些子任务之间的因果关系。 这是一个具有挑战性的问题, 因为有关世界的完整信息往往无法从视频中获取, 这就要求有强大的学习机制来理解事件的因果关系。 我们展示了多式子任务生成( MSG2), 这种方法构建了一个子任务图, 定义了任务中与吵闹的网络视频任务相关的子任务之间的依赖性。 我们的多式方法生成的图表比以往的方法更接近于人类附加说明的图表。 MSG2 进一步完成了下游任务, 即下一个子任务预测的下游任务比ProceL 和 CrossTask数据集中的最新视频变异模型更精确85% 和 30% 。