Computational models of syntax are predominantly text-based. Here we propose that basic syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and basic properties of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. To our knowledge, this is a previously unreported property of CNNs trained on raw speech in the Generative Adversarial Network setting and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution from raw acoustic inputs.
翻译:摘要:目前,语法的计算模型主要是基于文本的。本文提出可以在完全无监督的情况下直接从原始语音中建模基本语法的观点。我们关注了语法最普遍和基本的特性之一——连接。我们介绍了自发连接:一种卷积神经网络(CNNs)的现象,在该网络中,从个体单词的声音记录训练的网络开始生成包含两个甚至三个单词的输出,并且从未访问过多个单词的输入数据。此外,在两个单词上训练的网络学习将单词嵌入到新的未见过的词组合中。据我们所知,这是在生成式对抗网络(GAN)设置下训练的语音原始数据的CNNs中以前未报告过的属性,并且它对我们了解这些体系结构的学习方式以及对语法及其从原始声学输入演化的建模具有影响。