This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors along with multimodal inputs. Recently Yagcioglu et al. [35] introduced RecipeQA dataset to evaluate M3C. Our first contribution is the introduction of two new M3C datasets- WoodworkQA and DecorationQA with 16K and 10K instructional procedures, respectively. We then evaluate M3C using a textual cloze style question-answering task and highlight an inherent bias in the question answer generation method from [35] that enables a naive baseline to cheat by learning from only answer choices. This naive baseline performs similar to a popular method used in question answering- Impatient Reader [6] that uses attention over both the context and the query. We hypothesized that this naturally occurring bias present in the dataset affects even the best performing model. We verify our proposed hypothesis and propose an algorithm capable of modifying the given dataset to remove the bias elements. Finally, we report our performance on the debiased dataset with several strong baselines. We observe that the performance of all methods falls by a margin of 8% - 16% after correcting for the bias. We hope these datasets and the analysis will provide valuable benchmarks and encourage further research in this area.
翻译:本文针对的是程序多式联运机器理解问题(M3C) 。 这项任务需要AI 来理解多式联运指令的某些特定步骤,然后回答问题。 与香草机器理解任务相比,如果需要AI只理解文字输入,程序M3C就更具挑战性,因为AI需要理解时间和因果关系因素以及多式联运输入。 最近Yagcioglu等人(35)介绍了RegipeQA数据集来评价M3C。 我们的第一个贡献是引入两个新的 M3C 数据集- WoodworkQA 和装饰QA, 分别使用 16K 和 10K 教学程序。 然后我们使用一个文字相交风格解解答任务来评价香草机机机理解任务,并强调问题解答方法从[35] 中理解时间和因果关系的内在偏见。 最近Yagciocioleglulate bas 和Readger [6] 采用了一种在回答问题时使用的流行方法。 我们假设在数据设置了整个背景和查询时,这种自然出现的偏差将进一步影响着最佳的模型。 我们用16种模型解析的模型, 我们最后用了一种假设和算算算出我们的一些数据基准。 我们用了一种能的精确的数据, 我们用了这些基准, 我们用了这些模型来修正了一种数据推算测算出了这些数据。