Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets to define and evaluate this task, and propose a novel model which can provide joint textual rationale generation and attention visualization. Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X). We quantitatively show that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision. We also qualitatively show cases where visual explanation is more insightful than textual explanation, and vice versa, supporting our thesis that multimodal explanation models offer significant benefits over unimodal approaches.
翻译:在许多环境下,既有效又可解释的深层模型都是可取的; 先前可解释的模型是单式的,提供了基于图像的注意量的可视化或基于文本的后热源理由的生成。 我们建议采用多式解释方法,认为这两种模式提供了补充的解释优势。 我们收集了两个新的数据集来界定和评价这项任务,并提出了一个新颖的模型,可以提供共同的理论原理生成和关注可视化。 我们的数据集定义了活动识别任务(ACT-X)和可视问题回答任务(VQA-X)分类决定的视觉和文字理由。 我们量化地表明,用文字解释进行的培训不仅产生更好的文字解释模型,而且更好地将支持决定的证据本地化。 我们还定性地展示了视觉解释比文字解释更具有洞察力,反之,支持我们的理论,即多式解释模型比单式方法具有重大好处。