Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource -- both in terms of data and compute -- conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the language model through semi-supervised training than shallow fusion. Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training compared to shallow fusion.
翻译:在未受监督的代表性学习方面最近取得的进展表明,培训前培训对大量读话的影响。我们将这些技术用于在低资源领域适应领域 -- -- 无论是在数据方面还是在计算 -- -- 对话领域和广播领域。我们超越了CTC,以不受监督的方式预演了最先进的联系模式。虽然未受监督的方法优于传统的半监督培训,但技术是互补的。这些技术是WER的5%绝对改进,在所有条件下均值,而半监督培训则单独。额外的文本数据通过外部语言模型纳入。通过使用基于CTC的解码,我们更有能力利用额外的文本数据。当作为抄录模式使用时,Conold模式能够通过半监督的培训更好地纳入语言模式的知识,而不是浅污染。在使用基于CT的解码进行半监督的培训时,最后的性能是更高2%的绝对性。