Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose {\bf I}mage-{\bf t}ext {\bf A}lignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.
翻译:最近,多式命名实体识别(MNER)吸引了大量关注。 大部分工作都利用从预先培训的物体探测器获得的区域级图像显示,通过区域级图像显示,将图像输入功能与文本显示模式相匹配, 并依赖一个关注机制来模拟图像和文本表示方式之间的互动。 然而,很难将图像和文本表示形式等互动分别根据各自模式的数据进行训练, 且不在同一空间对齐。 由于文本表示方式在 MNER中扮演着最重要的角色, 我们在此文件中提议 $bf I}mag- bf t t ext $bf A}light 显示( ITA ) 将图像输入到文本空间空间, 以便更好地利用基于变异器的预培训文本嵌入模式的注意机制。 IMTA首先将图像与区域对象标记、 图像级说明和光学字符相匹配, 把它们与输入文本作为新的跨模式输入, 然后将其输入到一个经过培训的文本嵌入模式。 这样, 使得我们之前的图像输入型号的图像输入模式中的注意模块可以更方便, 并且将预制的文本转换到预制的文本转换到预制的版本。