Multi-modal word semantics aims to enhance embeddings with perceptual input, assuming that human meaning representation is grounded in sensory experience. Most research focuses on evaluation involving direct visual input, however, visual grounding can contribute to linguistic applications as well. Another motivation for this paper is the growing need for more interpretable models and for evaluating model efficiency regarding size and performance. This work explores the impact of visual information for semantics when the evaluation involves no direct visual input, specifically semantic similarity and relatedness. We investigate a new embedding type in-between linguistic and visual modalities, based on the structured annotations of Visual Genome. We compare uni- and multi-modal models including structured, linguistic and image based representations. We measure the efficiency of each model with regard to data and model size, modality / data distribution and information gain. The analysis includes an interpretation of embedding structures. We found that this new embedding conveys complementary information for text based embeddings. It achieves comparable performance in an economic way, using orders of magnitude less resources than visual models.
翻译:多式文字语义学的目的是加强包含感官经验的内嵌,假设人的意义代表基于感官经验。但大多数研究侧重于直接视觉投入的评价,视觉地面学也可以促进语言应用。本文的另一个动机是越来越需要更多解释模型,并评价关于大小和性能的模型效率。这项工作探讨了当评价不涉及直接视觉投入,特别是语义相似性和关联性时,视觉信息对语义学的影响。我们根据视觉基因组结构化说明,调查语言和视觉模式之间新的内嵌类型。我们比较单式和多式模型,包括结构化、语言和图像的外嵌。我们衡量每个模型在数据和模型规模、模式/数据分配和信息收益方面的效率。分析包括对嵌入结构的解释。我们发现,这种新嵌入式传递了基于文字嵌入的互补性信息。它以经济方式实现了可比较性业绩,使用比视觉模型更少的资源数量。