Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning, but some questions regarding their representation ability remain unanswered. This paper addresses two of them: (1) Can SSL speech models deal with non-speech audio?; (2) Would different SSL speech models have insights into diverse aspects of audio features? To answer the two questions, we conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of currently state-of-the-art SSL speech models, which are wav2vec 2.0 and HuBERT in this paper. These experiments are carried out during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results show that (1) SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets; (2) different SSL speech models have insights into different aspects of audio features. The two conclusions provide a foundation for the ensemble of representation models. We further propose an ensemble framework to fuse speech representation models' embeddings. Our framework outperforms state-of-the-art SSL speech/audio models and has generally superior performance on abundant datasets compared with other teams in HEAR Challenge. Our code is available at https://github.com/tony10101105/HEAR-2021-NeurIPS-Challenge -- NTU-GURA.
翻译:自我监督的学习(SSL)语言模型在语言代表学习方面取得了前所未有的成功,但有关其代表性能力的一些问题仍然没有得到回答。本文件针对其中两个实验:(1) SSL语言模型能否与非语音音频打交道?(2) 不同的SSL语言模型能否对音频特征的不同方面有洞察?为了回答这两个问题,我们对丰富的语音和非语音音频数据集进行了广泛的实验,以评价目前最先进的SSL语言模型的代表性能力,即本文中的 wav2vec 2.0和HuBERT。这些实验是在NeurIPS 2021 Emall Challenge作为竞争官员提供的标准评价管道进行的。结果显示:(1) SSL语言模型能够从广泛的非语音音频中提取有意义的特征,而它们也可能在某些类型的数据集上失败;(2) 不同的SSL语言模型对音频特征的不同方面有洞察。 这两项结论为全套的演示模型提供了基础。我们进一步提议一个包含语音代表模型的游戏框架,作为NurIPSERC-Challen嵌入式的NUR-101 Embreau-SLAS-SLA comstal Stabial Stabial Stabits-SLADLADSLA