In this paper, we provide a new perspective on self-supervised speech models from how the self-training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL refers to two typical models, HuBERT and data2vec, which use the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with no need for that much data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think that doing multi-task learning on self-supervised speech models from our perspective is a promising trend.
翻译:在本文中,我们从如何获得自我培训目标的角度对自我监督的演讲模式提供了一个新的视角。 我们将目标提取器推广到离线目标提取器(关闭)和在线目标提取器(在线)中。 在此基础上,我们提出了一个新的自监督学习多任务学习框架MT4SSL(MT4SSL),它代表通过整合多个目标促进自我监督的演讲演示学习。 MT4SSL(MT4SSL)指的是两种典型模式,即HuBERT和Data2vec(HUBERT和Data2vec),它们分别使用K- means 算法作为离线技术脱轨算法和一个没有梯度的教师网络(On-TE)。基于这一点,我们提出的模型在LibriSpeech基准上以无边际的边际方式超越了以前的SSLSL方法, 并且与最优秀且不需要这些数据的模型相比甚至更好。 此外,我们发现,使用离线和On-TE(On-TE)的结果在培训前阶段更趋一致。我们认为,从实效和效率上进行多任务式的学习是一个充满前景的模型。