Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.
翻译:预测经修改的声学框架是自我监督地学习语言代表的有效方式,然而,防止预先培训的模式过于完善是具有挑战性的。在本文件中,我们提议在变压器编码器的预培训中引入两种辍学正规化方法:(1) 关注辍学,(2) 层辍学。两种辍学方法都鼓励该模式使用全球语言信息,在重建隐蔽框架时避免仅仅复制本地频谱特征。我们评估了电话机分类和语音识别任务的拟议方法。实验表明,我们的辍学方法取得了竞争性成果,提高了下游任务分类准确性。