Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech processing tasks. Despite the success of these methods, they require large memory and high pre-training costs, making them inaccessible for researchers in academia and small companies. Therefore, this paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly. This method reduces HuBERT's size by 75% and 73% faster while retaining most performance in ten different tasks. Moreover, DistilHuBERT required little training time and data, opening the possibilities of pre-training personal and on-device SSL models for speech.
翻译:自我监督的语音代表学习方法,如 wav2vec 2. 0 和 隐藏单位 BERT (HuBERT), 利用未贴标签的语音数据进行预培训, 并为许多语音处理任务提供良好的演示。 尽管这些方法取得了成功, 但它们需要大量的记忆和高的预培训成本, 使得学术界和小公司的研究人员无法使用这些语言。 因此, 本文介绍了DistilHuBERT, 这是一种新的多任务学习框架, 直接从 HuBERT 模型中提取隐藏的演示。 这一方法将HuBERT 的大小减少75% 和 73%, 并保存了十项不同任务中的大多数性能。 此外, DutilHuBERT 需要很少的培训时间和数据, 开启了个人和 SSLSL 预培训模式的演讲可能性。