ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence. In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance and provides a better segmentation for the next iteration. In the experiment, we first investigate different choices of model designs. Then we compare the framework to different types of baselines: (i) supervised methods (ii) acoustic unit discovery based methods (iii) methods learning from unpaired data. Our framework performs consistently better than all acoustic unit discovery methods and previous methods learning from unpaired data based on the TIMIT dataset.
翻译:显示ASR最近取得了巨大的成绩。 但是, 他们大多依靠大量配对数据, 这对全世界低资源语言是行不通的。 本文调查如何直接从未受重视的电话序列和语音表达中学习。 我们设计了一个两阶段的迭接框架。 在第一阶段, GAN 培训被采用, 以寻找未受重视的语音和电话序列之间的映射关系。 在第二阶段, 引入另一个 HMM 模型来从发电机输出中培训, 这会提高性能, 并为下一次迭代提供更好的分隔。 在实验中, 我们首先调查模型设计的不同选择。 然后我们将框架与不同类型的基线进行比较:( 一) 监督方法 (二) 声音单位发现方法 (三) 从未受重视的数据中学习的方法。 我们的框架比所有声学单元发现方法和以前根据TIMEX数据集从未受重视的数据学习的方法都好。