Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 and 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller fraction of labeled data. We analyze WERs on the low-resource scenario and gender bias carried by one ATC dataset.
翻译:培训前自我监督的近期工作重点是利用大型无标签语音数据,建立稳健的终端到终端(E2E)声学模型(AM),这些模型以后可以对下游任务(如自动语音识别)进行微调,然而,很少有工作调查当数据属性在培训前阶段和微调阶段之间差异很大时对业绩的影响,称为域变换。我们通过分析下游ASRWav2Vec 2.0和XLS-R模型的稳健性,用于一个完全看不见的域,即空中交通控制通信。我们将这些模型以若干开放源和具有挑战性的ATC数据库为基准,其信号到噪音比率在5到20 dB之间。相对字错误率(WER)与基于混合的ASR基线相比,仅通过微调E2E声学模型和少量标签数据,获得20%至40%之间的减少率。我们分析了由ATC数据集进行的低资源情景和性别偏差。