In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the distributional semantics but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes grounded semantics for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowledge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations.
翻译:在自然语言处理中,大多数模型试图从文本中学习语义表达方式。 学习到的演示形式对分布语义进行编码, 但无法与物理世界的任何知识连接。 相反, 人类学习语言的方式是, 将概念和动作作为基础, 并将基于大脑的语义编码作为基础, 以认知为目的。 受这个概念和最近视觉语言学习工作的启发, 我们设计了一个在视觉语言学习的双流模式。 模型包括一个基于 VGG 的视觉流和一个基于 Bert 的语言流。 两个流合并成一个共同的表达空间。 通过交叉模式对比学习, 模型首先学习将视觉和语言表达方式与 MS COCO 数据集相匹配。 模型还学习通过跨模式关注模块和最近的工作, 检索到视觉语言学习的语义表达方式。 我们设计了一个在视觉基因组数据集中的双线操作者之间的视觉关系。 模型的语言流是一个独立的语言流模型, 能够将概念嵌入一个有视觉基础的语义空间。 这种关于视觉和语言表达方式的视觉理解方式, 也是基于人类直观和视觉结构的智能理解。 。,, 基础的逻辑的逻辑和视觉理解是建立在 基础的, 。