Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
翻译:自我监督学习(SSL) 是一种从未贴标签的数据中学习表达方式的强大技巧。 HubaERT 等基于变压器的模型,由地物提取器和变压器层组成,在语音领域领先。 SSL模型在一系列广泛的下游任务上进行了微调,这涉及对每项任务的大多数模式进行再培训。 先前的研究采用了应用适应器,这些适应器是天然语言处理(NLP)通常使用的小轻量级模块,以适应新任务。 然而,这种高效调控技术只提供变压器层的适应,但未能在功能提取器上进行调适。 在本文中,我们提议Chondro,这是专门为SSL语音模型设计的高效调适方法,在功能提取器上应用CNN调适器。使用这种方法,我们只能微调每任务不超过5%的参数,而充分调整和取得更好和更稳定的性能。 我们从经验中发现,在特性提取器中添加CNN的调适配制器可以帮助对情感和演讲任务进行调整。 例如,SID的精度从87.71到精确度提高5。