Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.
翻译:在线会议应用中的语音质量通常通过人类判断,以平均意见评分(MOS)衡量的方式评估在线会议应用中的语音质量。由于这种劳动密集型方法在多数情况下对大型语音质量评估不可行,因此重点已经转向通过深神经网络的端到端培训进行自动的MOS预测。我们提议从零开始培训一个网络,而不是从零开始培训一个网络,而是利用预先培训的 wav2vec 基于XLS-R-R 模式的语音表述。然而,这种模型的参数数量以几个数量级超过任务特定的DNN,这给导致对较小数据集的微调程序构成挑战。因此,我们选择使用XLS-R的预先培训性语音表述,而不是微调环境网络,从而大大减少了可培训的模型参数数量。我们将我们提议的XLS-R基于预先培训的特效提取器与基于Mel-频率的CMICC-1的硬质系数进行对比,并将经过培训的双向短期记忆(BI-LSTMTM)的各种组合进行试验,并将经过培训的S-S-road Stampalforning Supal Exlifermad 的S-S-S-S-resutebrodustrebleg-S-S-S-Sildrodustrautmental