We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 32.5% to just 1.9% absolute worse than the equivalent fully supervised models trained on the same data.
翻译:我们提出一种跨语言培训方法,使用绝对没有来自目标语言的转录培训数据,并且不掌握有关语言的语音知识。我们的方法是使用一种新应用的解译算法,该算法只使用来自目标语言的未节制的语音和文本数据。我们用这种解译法对一个通用电话识别器生成的电话序列进行关于语言外语语语语体的训练,我们用平开的半监督的半监督培训来获取新语言的音响模型。我们最了解的是,这是对零资源跨语言ASR的首个实用方法,不依赖任何手写电话信息。我们从GlobalPhone文中进行读话实验,并表明有可能在目标语言数据仅用20分钟的时间里学习解译模型。当我们用来生成半监控培训的伪标签时,我们获得的WERs从32.5%到1.9%的绝对差于同一数据所培训的完全监督的同等模型。