To enhance research on multimodal knowledge base and multimodal information processing, we propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB). We also develop a dataset for the problem using an existing MKB. In an MKB, there are entities and their associated texts and images. In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair. We solve the task by using the information retrieval paradigm and implement several baselines using state-of-the-art methods in NLP and CV. We conduct extensive experiments and make analyses on the experimental results. The results show that the task is challenging, but current technologies can achieve relatively high performance. We will release the dataset, code, and models for future research.
翻译:为了加强对多式联运知识基础和多式联运信息处理的研究,我们提议一项新的任务,即采用多式联运知识库(MKB)的多式联运实体标记(MET),我们还利用现有的MKB,为问题开发一套数据集。在MKB,有实体及其相关文本和图像。在MKB, 以文本图像为对,使用MKB的信息自动识别文本图像对中的相关实体。我们通过使用信息检索模式来完成任务,并利用NLP和CV中的最新方法实施若干基线。我们进行了广泛的实验,并对实验结果进行了分析。结果显示,这项任务具有挑战性,但当前技术可以取得相对较高的业绩。我们将发布数据集、代码和模型,供今后研究使用。