Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how encoding and decoding of lexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine the production and perception principles of speech. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: a network that must learn unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN arXiv:2006.02951) and test how the networks classify acoustic lexical items in unobserved test data. Strong evidence in favor of lexical learning and a causal relationship between latent codes and meaningful sublexical units emerge. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data without accessing real training data directly. We propose a technique to explore lexical (holistic) and sublexical (featural) learned representations in the classifier network. The results bear implications for unsupervised speech technology, as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.
翻译:人类演讲者将信息编码成原始语言,然后由听众解码。这种编码(制作)和解码(概念)之间的复杂关系往往是分开建模的。在这里,我们测试如何在未经监督的深层革命网络中,将信息编码和解码从原始语言中自动产生,这些网络结合了言论的制作和感知原则。就我们的知识而言,我们引入了未经监督的词汇学习中最具挑战性的目标:一个必须学习无法直接获取培训数据的词汇项目的独特表述的网络。我们培训了几种模型(ciwGAN和fwGAN arxiv:2006.02951),并测试了网络如何在未观测的测试数据中将声学词典词典术语进行分类。强有力的证据有利于词汇学习,以及潜伏的代码和有意义的亚缩微缩放单位之间的因果关系。因此,将生产和感知原则结合起来的架构能够从原始声调数据中解码独特的信息,而无需直接获取真正的培训数据。我们提议一种技术,用于探索非拉丁语学(ho格)和亚基(frealGal)类语言的翻版语言模型,作为高级语言的模型,越来越多地操作。