We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stories with fully annotated named entities). We then build upon the state-of-the-art Bi-LSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modality-attention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts from, adaptive to each sample and token. The proposed MNER model with modality attention significantly outperforms the state-of-the-art text-only NER models by successfully leveraging provided visual contexts, opening up potential applications of MNER on myriads of social media platforms.
翻译:我们引入了一个新的任务,名为多式命名实体识别(MNER),用于用户生成的数据,如推文或 Snapchat 字幕等,由短文本和相伴图像组成。这些社交媒体文章往往以不一致或不完整的语法和词汇符号出现,周围文字环境非常有限,给NER带来重大挑战。为此,我们为MNER创建了一个新的数据集,称为Snapchat Captions(Snapchat 图像描述配对,提交给公众和众源故事,配有全称名称的实体)。然后,我们又在最先进的Bi-LSTM Word/character NER模型上建起了一个深层次的图像网络,其中包含了相关视觉环境,以强化文本信息,或者2个通用模式注意模块,该模块在扩大最丰富的信息模式以从每个样本中提取背景,适应和象征。拟议的模式关注显著超出仅以文本为基础的模式模式模式,通过成功利用提供视觉环境,在无数社交媒体平台上打开MNER的潜在应用。