Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.
翻译:语音基础模型的自我监督前培训,随后进行有监督的微调,显示自动语音识别(ASR)任务的质量有令人印象深刻的改进。对许多下游任务的不同基础模型进行微调是昂贵的,因为基础模型通常非常大。参数效率高的微调方法(例如适配器、稀疏更新方法)提供了一个替代范例,其中一小套参数得到更新,使基础模型适应新的任务。然而,这些方法仍然受到高计算存储成本和慢速培训速度的影响,因为它们要求每步通过整个神经网络进行反向调整。在文件中,我们分析了语音识别任务基础模型不同层面的功能性能,并提出了一种创新的等级特征融合方法,用于从语音基础模型中学习资源效率高的传输。实验结果表明,拟议方法在语音识别任务上的表现比现有算法要好,现有算法的可培训参数较少,计算记忆成本较低,培训速度也较快。在与各个层次的适应者相结合之后,拟议方法可以实现与微调整个模型相同的效果,即以97美元较便宜的训练速度参数和速度。