Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context. While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood. In this work, we conduct in-depth analyses of existing multimodal fusion techniques from different perspectives and describe the scenarios where adding information from the image does not always boost performance. We also study the use of captions as a way to enrich the context for MNER. Experiments on three datasets from popular social platforms expose the bottleneck of existing multimodal models and the situations where using captions is beneficial.
翻译:多种名称的实体承认(MNER)需要弥合语言理解和视觉背景之间的差距。虽然许多多式联运神经技术已被提议将图像纳入MNER的任务,但模型利用多式联运互动的能力仍不为人所知。在这项工作中,我们从不同角度对现有多式联运融合技术进行深入分析,并描述从图像中添加信息并不总是能够促进业绩的各种情景。我们还研究字幕的使用,以此丰富MNER的背景。对流行社会平台的三个数据集的实验暴露了现有多式联运模式的瓶颈以及使用标题的好处。