Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.
翻译:多式名称实体识别(MNER)的目的是在图像的帮助下确定实体,并承认其在社交媒体职位中的类别,然而,在占主导地位的 MNER 方法中,不同模式的相互作用通常是通过自留和交叉注意或过度依赖格斗机的交替式来进行,这导致文本和图像微微粒的语义单位和图像之间不准确和有偏颇的对等。为了解决这一问题,我们提议为MNER 设置一个简单多式互动变换器(FMIT ) 。 具体地说,我们首先使用句子中的名词和普通域名词来获得视觉提示。 然后,我们将视觉和文本的精细的语义表达方式转换成统一的拉蒂斯结构,并设计一个新的相对位置编码,以适应变形器中不同模式。 同时,我们提议利用实体边界探测作为减轻视觉偏差的辅助任务。 实验表明,我们的方法在两个基准数据集上实现了新的状态表现。