Automatic speech recognition (ASR) has been established as a well-performing technique for many scenarios where lots of labeled data is available. Additionally, unsupervised representation learning recently helped to tackle tasks with limited data. Following this, hardware limitations and applications give rise to the question how to efficiently take advantage of large pretrained models and reduce their complexity for downstream tasks. In this work, we study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German. We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models, discuss how to adapt them to a practical telephony task including bandwidth transfer and investigate different data conditions for pre-training and fine-tuning. We outperform the project baselines by 22% relative using pretraining techniques. Further gains of 29% can be achieved by refinements of architecture and training and 6% by adding 0.8 h of in-domain adaptation data.
翻译:自动语音识别(ASR)是许多有标签数据可查的场景中的一种良好技术。此外,未经监督的代理学习最近帮助用有限的数据处理任务。在此之后,硬件限制和应用程序引发了一个问题,即如何高效率地利用大型预先培训的模型,降低其下游任务的复杂性。在这项工作中,我们研究了越南和德国医疗领域一个挑战性低资源电话语音保护软件。我们展示了使用未经监督的技术,而不仅仅是对大型预先培训的模型进行微调的好处,并讨论了如何使其适应实用的电话任务,包括带宽传输和调查培训前和微调的不同数据条件。我们使用预培训技术比项目基线高出了22%。通过改进结构和培训可以进一步实现29%的收益,通过增加0.8 h的现场适应数据可以实现6%的收益。