Fine-tuning of self-supervised models is a powerful transfer learning method in a variety of fields, including speech processing, since it can utilize generic feature representations obtained from large amounts of unlabeled data. Fine-tuning, however, requires a new parameter set for each downstream task, which is parameter inefficient. Adapter architecture is proposed to partially solve this issue by inserting lightweight learnable modules into a frozen pre-trained model. However, existing adapter architectures fail to adaptively leverage low- to high-level features stored in different layers, which is necessary for solving various kinds of speech processing tasks. Thus, we propose a new adapter architecture to acquire feature representations more flexibly for various speech tasks. In experiments, we applied this adapter to WavLM on four speech tasks. It performed on par or better than naive fine-tuning, with only 11% of learnable parameters. It also outperformed an existing adapter architecture.
翻译:自我监督模型的微调是多个领域,包括语音处理中的一种强有力的传输学习方法,因为它可以使用从大量未贴标签数据中获得的通用特征表示法。 但是,微调需要为下游任务设定一个新的参数,这个参数是低效率的。 提议调整结构通过将轻量可学习模块插入冻结的预培训模型来部分解决这一问题。 但是,现有的适应型结构无法适应性地利用储存在不同层次的低至高层次功能,而这种功能是解决各种语音处理任务所必需的。 因此,我们建议一个新的适应型结构来为各种语音处理任务获取更灵活的特征表示法。 在实验中,我们用该适应器对WavLM 做四项演讲任务,它比天真的微调好,只有11%的可学习参数。 它也比现有的适应型结构要强。