In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the \emph{multi-task learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the \emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55\% to 10.36\% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario.
翻译:在这项工作中,我们开发了新的自学技术,采用以关注为基础的顺序到顺序的自动语音识别(ASR)模式(seq2seq) 。对于未经调试的语音数据,必须使用ASR系统的假设作为标签。然而,不完善的ASR结果使得未经监督的学习难以持续地提高认知性能,特别是在没有多个强大教师模式的情况下。与传统的不受监督的学习方法相比,我们采用了以美元为单位的最佳ASR假设为标签的自学框架(MTL)框架。对于未经调试的语音数据,后继2eq网络必须通过MTL框架更新,以便找到能够涵盖多种假设的共同代表。通过这样做,可以减轻mepph{hard-decism}错误的影响。我们首先通过ASR实验,在美国和英国言语的调调调调调调时,展示了我们的自学方法的有效性。我们的实验结果表明,我们的方法可以将英国语言数据WER数据从14.55>至10.36*比较我们所培训的英国语模型的学习基准方案,我们只用所学习的英式数据。