Multi-modal and multi-hop question answering aims to answer a question based on multiple input sources from different modalities. Previous methods retrieve the evidence separately and feed the retrieved evidence to a language model to generate the corresponding answer. However, these methods fail to build connections between candidates and thus cannot model the inter-dependent relation during retrieval. Moreover, the reasoning process over multi-modality candidates can be unbalanced without building alignments between different modalities. To address this limitation, we propose a Structured Knowledge and Unified Retrieval Generation based method (SKURG). We align the sources from different modalities via the shared entities and map them into a shared semantic space via structured knowledge. Then, we utilize a unified retrieval-generation decoder to integrate intermediate retrieval results for answer generation and adaptively determine the number of retrieval steps. We perform experiments on two multi-modal and multi-hop datasets: WebQA and MultimodalQA. The results demonstrate that SKURG achieves state-of-the-art performance on both retrieval and answer generation.
翻译:多式和多式问题解答旨在解答基于不同模式的多种输入源的问题。 先前的方法是将证据分开,并将检索的证据输入语言模型,以得出相应的答案。 但是,这些方法未能在候选人之间建立联系,因此无法在检索过程中模拟独立关系。 此外,多式候选人的推理过程可能不平衡,而没有在不同模式之间建立联系。 为解决这一限制,我们提议了一个基于结构化知识和统一检索生成方法(SKURG) 。 我们通过共享实体将不同模式的来源通过共享实体进行对齐,并通过结构化知识将其映射成一个共同的语义空间。 然后,我们使用统一的检索生成解码器整合中间检索结果,以便生成答案,并适应性地决定检索步骤的数量。 我们在两个多式和多式数据集上进行了实验: WebQA 和 Multi-hopalQA。 结果显示, SKURG在检索和回答生成两方面都取得了最先进的业绩。