Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content, and it plays an important role for various applications such as intention understanding and user recommendation. With social media posts tending to be multimodal, Multimodal Named Entity Recognition (MNER) for the text with its accompanying image is attracting more and more attention since some textual components can only be understood in combination with visual information. However, there are two drawbacks in existing approaches: 1) Meanings of the text and its accompanying image do not match always, so the text information still plays a major role. However, social media posts are usually shorter and more informal compared with other normal contents, which easily causes incomplete semantic description and the data sparsity problem. 2) Although the visual representations of whole images or objects are already used, existing methods ignore either fine-grained semantic correspondence between objects in images and words in text or the objective fact that there are misleading objects or no objects in some images. In this work, we solve the above two problems by introducing the multi-granularity cross-modality representation learning. To resolve the first problem, we enhance the representation by semantic augmentation for each word in text. As for the second issue, we perform the cross-modality semantic interaction between text and vision at the different vision granularity to get the most effective multimodal guidance representation for every word. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets. The code, data and the best performing models are available at https://github.com/LiuPeiP-CS/IIE4MNER
翻译:在社交媒体上命名实体识别(NER) 是指发现和分类实体,使其从非结构化的自由形式内容中分离出来,它对于各种应用程序(如意向理解和用户建议)起着重要作用。随着社交媒体的发布倾向于多式化,对文本及其随附图像的多式命名实体识别(MNER)正在吸引越来越多的关注,因为某些文本组件只能与视觉信息结合理解。然而,现有方法有两个缺点:(1) 文本的含义及其随附图像并不总是匹配,因此文本信息仍然发挥着主要作用。然而,与其他正常内容相比,社交媒体的发布通常更短、更非正式,这很容易造成语义描述不完整和数据宽度问题。(2) 尽管已经使用了整个图像或物体的视觉表达方式,但现有的方法忽视了图像和文字中对象之间微小的语义性对应,或者某些图像中没有对象的客观事实。在这项工作中,我们通过在多式跨式模式/多式版本中引入了上述两个问题。我们通过在每部语言上显示双级的精确度表达方式,在Sloialalal-alalalalalal 学习每个图像的演示中,我们通过Seal-deal demodeal 。