Visual grounding of Language aims at enriching textual representations of language with multiple sources of visual knowledge such as images and videos. Although visual grounding is an area of intense research, inter-lingual aspects of visual grounding have not received much attention. The present study investigates the inter-lingual visual grounding of word embeddings. We propose an implicit alignment technique between the two spaces of vision and language in which inter-lingual textual information interacts in order to enrich pre-trained textual word embeddings. We focus on three languages in our experiments, namely, English, Arabic, and German. We obtained visually grounded vector representations for these languages and studied whether visual grounding on one or multiple languages improved the performance of embeddings on word similarity and categorization benchmarks. Our experiments suggest that inter-lingual knowledge improves the performance of grounded embeddings in similar languages such as German and English. However, inter-lingual grounding of German or English with Arabic led to a slight degradation in performance on word similarity benchmarks. On the other hand, we observed an opposite trend on categorization benchmarks where Arabic had the most improvement on English. In the discussion section, several reasons for those findings are laid out. We hope that our experiments provide a baseline for further research on inter-lingual visual grounding.
翻译:语言的视觉定位旨在丰富语言的文字表达形式,包括图像和视频等多种视觉知识的多种来源。虽然视觉定位是一个密集研究的领域,但视觉定位的多种语言方面没有引起多少注意。本研究报告调查了语言嵌入语言的多种语言视觉定位。我们提议在两种语言的视觉空间和语言之间采用隐含的调整技术,在两种语言空间中,语言之间的文字信息相互作用,以丰富经过培训的文字嵌入。我们在实验中侧重于三种语言,即英语、阿拉伯语和德语。我们获得了这些语言的视觉定位矢量表,并研究了在一种语言或多种语言上视觉定位是否改善了语言相似和分类基准的嵌入性。我们的实验表明,语言间知识改善了类似语言(如德语和英语)的嵌入性功能。然而,语言之间的德语或英语与阿拉伯语之间的定位使得语言相似基准的性能略有下降。另一方面,我们发现在分类基准上出现了一种相反的趋势,即阿拉伯语在英语上得到了最大的改进。在讨论的一节中,为这些结论的跨语言研究提供了进一步的基础。我们提出了一种希望。