Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system
翻译:现代端到端语音识别模式在将音频信号转换成书面文本方面显示了惊人的结果。然而,常规数据输入管道对于低资源语音识别来说可能还不是最理想的,这仍是一项艰巨的任务。我们建议采用自动化课程学习方法,根据模型的进展优化培训范例的顺序,同时培训和事先了解培训实例的困难程度。我们引入了一个新的困难度量,称为压缩率,可以用作各种噪音条件下原始音频的评分功能。拟议方法使语音识别Word错误率比基线系统提高高达33%的相对性能。