Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
翻译:自我监督的学习表现(SSLR)在许多领域为下游任务带来了强有力的特征。最近,一些SSLR在自动语音识别基准公司方面显示出了令人乐观的结果。然而,以前的研究只显示单方SSLR的性能是ASR模型的一种输入特征。在本研究中,我们提议使用端至端(E2E)ASR模型中的各种融合方法调查各种SSLR组合的有效性。此外,我们将表明这些提取的SSLR之间有相互关系。因此,我们进一步建议对调整功能进行改进损失,以便有效地结合整套投入特征。关于评估,我们表明,拟议的“FeaRESS学习特征”比系统运行得更好,而没有为WSJ和无恐惧步骤挑战公司(FSC)拟议的改进功能损失。