Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between the spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.
翻译:最近的计算模型,即通过感知来获取口语,利用口语和视觉模式之间的关联,并学习在共同矢量空间中代表言语和视觉数据。从生态有效性角度来说,一个主要的未决问题是培训数据,通常由图像或视频组成,并配有描述所描述的内容的口语描述。这种设置保证了言语和视觉数据之间不切实际的紧密关联。在现实世界中,语言和视觉模式的结合松散,而且往往被与语言信号的非语义方面的关联所混淆。在这里,我们利用基于儿童漫画Peppa Pig的数据集来应对这一缺陷。我们训练一个简单的双模式结构,以包含人间对话的数据部分为主,并评价含有描述性叙述性说明的部分。尽管在培训数据中,我们的模型在学习语音视觉语义的方方面成功。