Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short. We notice that current visual semantic embedding models lack a mutual exclusivity bias which limits their performance.
翻译:将图像映射成丰富的语义嵌入空间的视觉语义嵌入模型(VSE)模型(VSE)模型(VSE)模型(VSE)模型(将图像映射成丰富的语义嵌入空间)是物体识别和零光学习的一个里程碑。 VSEE目前的方法在很大程度上依赖静态的单词嵌入技术。在这项工作中,我们提议了一个视觉语义嵌入模型(VSEP)模型(VSEP)模型(VSEP)模型(VSEP),旨在探测视觉语义嵌入视觉语义解理解任务的背景语言词嵌入的语义信息。我们显示,变异语言语言模型编码的语义嵌入模型可用于需要视觉语义理解的任务。有背景的VSEPEP(VSEP)可以区分复杂场景中的单词级对象表达方式,作为成成成文的零光学学习者。我们进一步引入了与VSEPSEPs(VSEPs)的零光谱设置零光度设置,以评价模型将新词和新视觉视觉视觉视觉视觉视觉视觉视觉视觉视觉语言分类的能力。我们发现,当物体组成链短短时,在语言的语义嵌入模块外的语义嵌入式词嵌入的语义嵌入时,没有相互的偏差限制其性。我们发现。我们注意到。