Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models learned from the high-resource language through unsupervised domain adaptation (UDA) schemes can enhance the performance of the ASR in the low-resource language. We use the specific example of Hindi in the source domain and Sanskrit in the target domain. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation networks (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate over the baseline deep neural network model when trained on just 5.5 hours of data in the target domain. We also show that choosing a proper language (Telugu) in the source domain can bring further improvement. The results suggest that UDA schemes can be helpful in the development of ASR systems for low-resource languages, mitigating the hassle of collecting large amounts of annotated speech data.
翻译:从零开始建立自动语音识别系统(ASR)需要大量附加说明的语音数据,这很难用多种语言收集。然而,在有些情况下,低资源语言共用一个通用音响空间,拥有足够附加说明的数据,以建立ASR。在这种情况下,我们表明,通过不受监督的域适应(UDA)计划从高资源语言中学习的域独立声学模型可以提高ASR在低资源语言中的性能。我们使用源域和目标域的梵语的具体例子。我们探索了两种结构:i)使用梯度反向层(GRL)和二)域隔离网络(DSN)进行域对域对抗培训。GRL和DSN的架构在目标域仅用5.5小时的数据培训时,对基线深度神经网络模型的文字错误率分别给予6.71%和7.32%的绝对改善。我们还表明,在源域域选择适当语言(Telugu)可以带来进一步的改进。我们探索了两种结构:(i)使用梯度反向层反向层(GRL)和域隔离网络网络的域网格培训。结果表明,为低语音系统收集大量数据是有用的。