Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obtained from beam search. This work proposes to perform (approximative) recombinations of hypotheses during beam search, if they share a common local history. The error that is incurred by the approximation is analyzed and it is shown that using this technique the effective beam size can be increased by several orders of magnitude without significantly increasing the computational requirements. Lastly, it is shown that this technique can be used to effectively perform sequence discriminative training for attention-based encoder-decoder acoustic models on the LibriSpeech task.
翻译:序列歧视培训是改进自动语音识别系统性能的极好工具。 但是,它确实需要对所有可能的单词序列进行总和,这在实践中难以计算。目前最先进的无限制标签系统通过将总和限制在从光束搜索中获得的相关相竞假设的最佳清单中,绕过这一问题。这项工作提议在光束搜索中进行(相近的)假设重组,如果这些假设具有共同的本地历史。对近似引起的错误进行了分析,并表明使用这一技术,有效波束的尺寸可以通过若干数量级增加,而不会大大增加计算要求。最后,它表明,这一技术可用于有效地对LibriSpeech 任务中基于注意的电码解码调音模型进行有区别的系列培训。