Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation.
翻译:代表学习是自然语言处理的基础。 这项工作提出了新的方法, 将视觉信息用作一般语言处理任务的辅助信号。 对于每个句子, 我们首先从现有句子图像配对的光主题图像查看表或共同的跨模式嵌入空间中取取取一些灵活的图像, 这些图像可在现有句子图像配对中提取, 或者在现成文本图像配对上预先培训过的共享的跨模式嵌入空间。 然后, 文本和图像分别由一个变换器编码和连动神经网络编码。 两种表达的顺序由两种模式相互作用的注意层进一步融合。 在本研究中, 检索过程是可控的和灵活的。 通用的视觉表达方式克服了大型双语句面图像配对的缺乏。 我们的方法可以很容易地应用于只文本的任务, 而无需手动的多式多式平行组合组合体。 我们对一系列广泛的自然语言生成和理解任务, 包括神经机器翻译、 自然语言的推断和语义相似性。 实验性结果显示, 我们的方法提供了不同的视觉图像和图像关系, 分析中可能提供不同语言的精确的图像关系。