Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these models are fine-tuned on a small amount of labeled data for a downstream task such as Automatic Speech Recognition (ASR). This involves re-training the majority of the model for each task. Adapters are small lightweight modules which are commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. In this paper we propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks, and increase scalability of the model to multiple tasks or languages. Using adapters we can perform ASR while training fewer than 10% of parameters per task compared to full fine-tuning with little degradation of performance. Ablations show that applying adapters into just the top few layers of the pre-trained network gives similar performance to full transfer, supporting the theory that higher pre-trained layers encode more phonemic information, and further optimizing efficiency.
翻译:自监管学习(SSL)是一个强大的工具,可以从未贴标签的数据中学习基本表达方式。 Wav2vec 2. 0 和 HuBERT 等基于变换器的模型在语音域中领先。 这些模型一般在诸如自动语音识别(ASR)等下游任务的少量标签数据上进行微调。 这涉及对每个任务的大多数模式进行再培训。 适应器是小型轻量级模块, 通常在自然语言处理( NLP) 中使用, 以适应经过培训的模型, 以适应新的任务。 在本文中, 我们建议应用 wav2vec 2. 0 等基于变换器来减少下游 ASR 任务所需的参数数量, 并将模型的缩放能力提高到多种任务或语言。 我们可以使用适应器来进行适应器, 同时对每项任务培训不到10%的参数, 而不是完全微调性能。 缩缩略图显示, 将适应器应用在经过培训的网络顶层应用后, 也能完全传输类似性能, 支持高级经过培训的层将更多的语音信息编码进一步优化效率理论 。