This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterance-wise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. This enables an accurate alignment benefiting from the strong inference ability of the state-of-the-art attention-based encoder-decoder models, which cannot be applied to the conventional methods. Two different Transformer models named forward Transformer and backward Transformer are respectively used for estimating an initial and final tokens of a given speech segment based on end-of-sentence prediction with teacher-forcing. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment, that matches the manually annotated alignment with as few as 0.2% errors. It is also confirmed that a Transformer-based hybrid CTC/Attention ASR model using the aligned speech and text pairs as an additional training data reduces character error rates relatively up to 59.0%, which is significantly better than 39.0% reduction by a conventional alignment method based on connectionist temporal classification model.
翻译:本文为自动语音识别提出了一个新的标签同步语音对文本调整技术(ASR) 。 语音对文本的校正是一个将长长的录音记录与不统一的记录誊本分割成语音和文本的问题。 与基于框架同步预测的传统方法不同, 拟议的方法重新界定了语音对文本的校正, 将其作为标签同步文本绘图问题。 这有利于精确校正, 受益于基于最新关注的编码交换器模型的强烈推论能力, 无法应用于常规方法。 名为远端变异器和后向变异器的两种不同的变异器模型,分别用于估算基于对教师的校正预测而给定的语音部分的初始和最终象征。 使用自发日文字组合的实验表明,拟议方法提供了准确的绝对一致, 与人工加注的调整匹配率几乎不及0.2%的错误。 它还确认,基于变异式混合/变式组合的变换式组合式转换器和后向变式变式变式变式变式的调率, 使用基于快速递增式的缩式的缩制的缩式格式方法, 将缩制的缩式组合的缩制的缩式组合的缩式组合的缩式组合的缩式组合的缩写式的缩式格式, 的缩制式的缩制的缩制的缩制式的缩制式的缩制的缩制的缩制式的缩制式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩制式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩式的缩制的缩制的缩制的缩制的缩制的缩制的缩式的缩式的缩式的缩式的缩式的缩式