Self-supervised learning (SSL) has achieved great success in various areas including speech processing. Recently, it is proven that speech based SSL models are able to extract superior universal representations on a range of downstream tasks compared to traditional hand-craft feature (e.g. FBank, MFCC) in the SUPERB benchmark. However, different types of SSL models might exhibit distinct strengths on different downstream tasks. In order to better utilize the potential power of SSL models, in this work, we explore the effective fusion on multiple SSL models. A series of model fusion algorithms are investigated and compared by combining two types of SSL models, Hubert and Data2vec, on two representative tasks from SUPERB benchmark, which are speaker identification (SID) and automatic speech recognition (ASR) tasks. The experimental results demonstrate that our proposed fusion algorithms can further boost the individual model significantly.
翻译:最近,事实证明,与SUPERB基准中的传统手工艺特征(如FBank、MFCC)相比,基于语言的SSL模型能够在一系列下游任务中获得优异的普遍代表,但不同类型的SSL模型在不同的下游任务中可能具有不同优势。为了更好地利用SSL模型的潜在力量,我们在此工作中探索多种SSL模型的有效融合。通过将SSL模型的两种类型,即Hubert和Data2vec,结合到SUPERB基准中的两种具有代表性的任务(即语音识别(SID)和自动语音识别(ASR),对一系列模型集成算法进行了调查和比较。实验结果表明,我们提议的聚成算法可以极大地推动个人模型的发展。