Self-supervised models for speech processing emerged recently as popular foundation blocks in speech processing pipelines. These models are pre-trained on unlabeled audio data and then used in speech processing downstream tasks such as automatic speech recognition (ASR) or speech translation (ST). Since these models are now used in research and industrial systems alike, it becomes necessary to understand the impact caused by some features such as gender distribution within pre-training data. Using French as our investigation language, we train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in their pre-training data. The comparison is performed by applying these models to two speech-to-text downstream tasks: ASR and ST. Results show the type of downstream integration matters. We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system. However, when self-supervised models are used as feature extractors, the overall ASR and ST results follow more complex patterns in which the balanced pre-trained model does not necessarily lead to the best results. Lastly, our crude 'fairness' metric, the relative performance difference measured between female and male test sets, does not display a strong variation from balanced to gender-specific pre-trained wav2vec 2.0 models.
翻译:语音处理的自我监督模式最近成为语音处理管道中受欢迎的基石。这些模式在未贴标签的音频数据上经过预先培训,然后用于语言处理下游任务,如自动语音识别(ASR)或语音翻译(ST)等。由于这些模式现在在研究和工业系统中同样使用,因此有必要理解培训前数据中性别分布等特征的影响。使用法语作为我们调查语言,我们培训和比较针对性别的wav2vec 2.0模式,而培训前数据中包含不同程度性别平衡的模型。比较是通过将这些模型应用于两种语音对文本的下游任务:ASR和ST。结果显示下游一体化事项的类型。我们观察的是,在微调培训前使用针对不同性别的预培训前系统之前,总体绩效较低。然而,当自我监督的模式被用作特征提取器时,总体的ASR和ST结果遵循了更为复杂的模式,其中平衡的预培训前模式不一定导致最佳结果。最后,我们粗略的“公平度”衡量标准,相对的绩效差异是女性与2.0之前的性别差异。