Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. Most existing efforts largely focused on directly extracting potentially useful information from images (such as pixel-level features, identified objects, and associated captions). However, such extraction processes may not be knowledge aware, resulting in information that may not be highly relevant. In this paper, we propose a novel Multi-modal Retrieval based framework (MoRe). MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively. Next, the retrieval results are sent to the textual and visual models respectively for predictions. Finally, a Mixture of Experts (MoE) module combines the predictions from the two models to make the final decision. Our experiments show that both our textual model and visual model can achieve state-of-the-art performance on four multi-modal NER datasets and one multi-modal RE dataset. With MoE, the model performance can be further improved and our analysis demonstrates the benefits of integrating both textual and visual cues for such tasks.
翻译:多式名称实体识别(NER)和关系提取(RE)旨在利用相关图像信息来改进NER和RE的性能。大多数现有努力主要侧重于直接从图像中直接提取潜在有用的信息(例如像素级特征、被识别对象和相关标题)。然而,这种提取过程可能并不为人所知,导致信息可能不十分相关。在本文件中,我们提议了一个基于新颖的基于多式检索的多式检索框架(MoRe)。MoRe包含一个文本检索模块和一个基于图像的检索模块,分别检索知识库中输入文本和图像的相关知识。接下来,检索结果将分别发送到用于预测的文本和视觉模型。最后,专家混合模块(MOE)将两种模型的预测结合起来作出最后决定。我们的实验表明,我们的文本模型和视觉模型可以在四个多式NER数据集和一个多式数据集实现最新性能。随着MOE的改进,模型性能可以进一步改进,我们的分析展示了将文本和视觉任务结合起来的好处。