Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
翻译:建模式的跨模式互动在多式联运任务中似乎至关重要,例如视觉问答。然而,有时高性能黑盒算法大多是利用数据中的单式信号。我们提出了一个新的诊断工具,即实验式多式功能预测(EMAP ), 以孤立跨式互动是否改善了特定任务模式的性能。这个功能预测改变模型预测,以便消除跨式互动,分离添加、单式结构。对于7个图像+文本分类任务(其中每一个我们设定了新的最新基准),我们发现,在许多情况下,消除跨式互动几乎不会导致性能退化。令人惊讶的是,即使具有考虑互动能力的表达式模型,否则会超越表达式;因此,绩效改进,即便在目前,也往往不能归因于跨式特征互动。我们因此建议,多式联运机的研究人员不仅报告非形式基线的性能,而且报告其最佳模型的EMAP 。