Attempts to computationally simulate the acquisition of spoken language via grounding in perception have a long tradition but have gained momentum in the past few years. Current neural approaches exploit associations between the spoken and visual modality and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual world. In the real world the coupling between the linguistic and the visual is loose, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal. The current study is a first step towards simulating a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.
翻译:试图通过感知地面模拟获取口语的计算过程具有悠久的传统,但在过去几年中已经获得了势头。 当前的神经方法利用了口语和视觉模式之间的联系,并学会在共同矢量空间中代表言语和视觉数据。 从生态有效性角度来说,一个主要的未决问题是培训数据,通常由图像或视频组成,并配有描述所描述的内容的口语描述。这种设置保证了言语和视觉世界之间不切实际的紧密联系。 在现实世界中,语言和视觉之间的结合松散,而且往往包含与语言信号的非语义方面的相关性的混杂。 目前的研究是利用儿童漫画Peppa Pig 的数据集模拟自然地基情景的第一步。 我们训练了一个简单的双模式结构,该部分数据由人物之间的自然对话组成,对包含描述性叙述性叙述性说明的部分进行评估。尽管这一培训数据中的信号很弱,而且具有说服力。 我们的模型成功地学习了语言视觉语义学的各个方面。