Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.
翻译:在各种深层次的基于学习的演讲任务中,特别是数据数量有限的演讲任务中,成功地应用了自我监督学习模式(SSL),但是,SSL的表述质量在很大程度上取决于SSL培训领域和目标数据领域之间的关系。相反,光谱特征提取器,如log Mel-filterbanks(SF)是手工制作的不可注销的组成部分,对域变可能更加有力。本项工作审查了以下假设:将非可注销的SF提取器与SSL模型相结合,是处理低资源演讲任务的一种有效办法。我们提出了一个将SF和SSL的表述结合起来的可学习和可解释的框架。拟议的框架大大超越了三个低资源数据集的自动语音识别和语音翻译的基线和SSL模式。我们另外设计了一个基于组合模式的专家混合物。这个最后一个模型显示,SLF模型相对于常规的提取器的相对贡献很小,因为SLF培训集和目标语言数据之间的域错配。