Automatic Speech Recognition (ASR) systems frequently use a search-based decoding strategy aiming to find the best attainable transcript by considering multiple candidates. One prominent speech recognition decoding heuristic is beam search, which seeks the transcript with the greatest likelihood computed using the predicted distribution. While showing substantial performance gains in various tasks, beam search loses some of its effectiveness when the predicted probabilities are highly confident, i.e., the predicted distribution is massed for a single or very few classes. We show that recently proposed Self-Supervised Learning (SSL)-based ASR models tend to yield exceptionally confident predictions that may hamper beam search from truly considering a diverse set of candidates. We perform a layer analysis to reveal and visualize how predictions evolve, and propose a decoding procedure that improves the performance of fine-tuned ASR models. Our proposed approach does not require further training beyond the original fine-tuning, nor additional model parameters. In fact, we find that our proposed method requires significantly less inference computation than current approaches. We propose aggregating the top M layers, potentially leveraging useful information encoded in intermediate layers, and relaxing model confidence. We demonstrate the effectiveness of our approach by conducting an empirical study on varying amounts of labeled resources and different model sizes, showing consistent improvements in particular when applied to low-resource scenarios.
翻译:自动语音识别系统经常使用基于搜索的解码战略,目的是通过考虑多个候选人来找到最佳的成绩。一个突出的语音识别解码功能是光学搜索,它寻求的是极有可能使用预测分布进行计算的记录;虽然在各种任务中显示了大量业绩,但是在预测的概率高度有信心时,光学搜索会失去部分效力,也就是说,预测的分布将集中在单个或极少数类别。我们发现,最近提出的基于自高学习(SSL)的ASR模型往往产生非常自信的预测,可能妨碍从真正考虑一组不同的候选人中进行分辨。我们进行层分析,以揭示和设想预测如何演化,并提出一种解码程序,改进微调的ASR模型的性能。我们提议的方法不需要在最初的微调或额外模型参数之外进行进一步的培训。事实上,我们提出的方法比目前的方法要少得多的推算。我们提议,在中间层中将高级M层集合起来,有可能将有用的信息用作对各种候选人的编码,在不同的层次上展示我们所采用的特定水平的标准化,在不同的层次上展示我们不同的资源改进的模型。