We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, which led to the thriving field of robust machine learning. We consider this important issue in the setting of automatic speech recognition. With the increasing popularity of pre-trained models, it's an important question to analyze and understand the robustness of such models to noise. In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets. We use different kinds of noising mechanisms and measure the model performances as quantified by the inference time and the standard Word Error Rate metric. We also do an in-depth layer-wise analysis of the wav2vec2 model when injecting noise in between layers, enabling us to predict at a high level what each layer learns. Finally for this model, we visualize the propagation of errors across the layers and compare how it behaves on clean versus noisy data. Our experiments conform the predictions of Pasad et al. [2021] and also raise interesting directions for future work.
翻译:我们调查了预先训练的神经模型的稳健性能,以便自动语音识别。在机器学习中,真实的生活数据通常非常吵闹,而且几乎从来不干净,这可以归因于不同领域的各种因素,例如外部线、随机噪音和对立噪音。因此,我们为各种任务开发的模式应当对此类吵动数据具有很强的特性,从而导致强有力的机器学习的蓬勃领域。我们在自动语音识别的设置中考虑到这一重要问题。随着预先训练的模型越来越受欢迎,分析和理解这类模型对噪音的稳健性能是一个重要问题。在这项工作中,我们对预先训练的神经模型Wav2vec2、HuBERT和DistilHuBERT进行稳健性分析,从而使我们能够在LibriSpech和TIMIT数据集中预测高水平的神经模型和DistilHERT进行预测。我们使用不同的消化机制并衡量模型的性能,用推导时间和标准Word错误度度度度衡量。我们还从层对在两个层次之间注入噪音时的 wav2c2模型进行深入的层次分析。我们能够预测未来如何在高层次和高层次上预测我们如何在高层次上预测。我们如何在高层次上对每一层次上如何进行更精确地研究。