Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we explore multilingual transfer: we train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a word discrimination task on six target languages, all of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision. When using only a few training languages, the multilingual CAE performs better, but with more training languages the other multilingual models perform similarly. Using more training languages is generally beneficial, but improvements are marginal on some languages. We present probing experiments which show that the CAE encodes more phonetic, word duration, language identity and speaker information than the other multilingual models.
翻译:声词嵌入器是可变长的语音段的固定尺寸表示。 这种嵌入器可以构成语言搜索、索引和发现系统的基础,而常规语音识别是不可能的。 在无资源环境中,无标签的语音是唯一可用的资源,我们需要一种在任意语言上进行强有力嵌入的方法。 我们在这里探索多语种传输: 我们用多种资源丰富的语言对标签数据进行单一的监管嵌入模型, 然后将其应用到隐蔽的零资源语言中。 我们考虑三种多语言经常性神经网络(RNN)模式: 对所有培训语言的联合多语种词汇进行培训的分类者; 一个亚马士兰语网络(RNNN), 能够区分多种语言的同义和不同语言; 通信自动编码(CAE) (CAE) (CANN) (CAE) (CAN), 训练用来重建对词配对。 在六种目标语言的字式上,所有这些模式都超越了在零资源语言本身培训的最先进的最先进的模式, 在平均精确度上比30%强。 当仅使用几种培训语言时, 多语言的多语言模式时, 使用其他语言(CAEE) 的实验性语言比其他语言更好。