Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant text-image pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method of text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets.
翻译:最近命名为多式联运的实体识别(MNER)利用图像来提高微博中净值的准确性,然而,多数多式联运方法都使用关注机制来提取视觉线索,而不论文字和图像是否相关。实际上,无关的文本图像对在微博中占了很大比例。与文本无关的视觉线索将对多式联运模式学习产生不确定甚至消极的影响。在本文中,我们引入了一种将文字图像关系传播到多式BERT模型中的方法。我们整合了软或硬门来选择视觉线索,并提出了在MNER数据集上培训的多任务算法。在实验中,我们深入分析了在使用文本图像传播之前和之后视觉关注的变化。我们的模型在MNER数据集上取得了最先进的表现。