Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervised Learning approach for audio representation via Feature Diversity and Decorrelation. SELFIE avoids the collapsing issue by ensuring that the representation (i) maintains a high diversity among embeddings and (ii) decorrelates the dependencies between dimensions. SELFIE is pre-trained on the large-scale AudioSet dataset and its embeddings are validated on nine audio downstream tasks, including speech, music, and sound event recognition. Experimental results show that SELFIE outperforms existing SSL methods in several tasks.
翻译:自我监督学习(SSL)最近在缩小受监督学习与不受监督学习之间的差距方面显示出显著的成果。 想法是学习强健的特征, 这些特征对输入数据扭曲是无法改变的。 尽管它取得了成功, 但这个想法可能会在网络产生持续代表性的崩溃问题中受到损害。 为此,我们引入了SELFIE, 这是一种新型的自我监督学习方法,通过地貌多样性和礼貌关系来表达声音。 SELFIE 避免了崩溃问题, 其方法是确保代表(i) 保持各嵌入层之间的高度多样性, 以及(ii) 调整各维之间的依赖性。 SELFIE 接受过大规模音频数据集的预先训练, 其嵌入程序在9个音频下游任务上得到验证, 包括语音、 音乐和声音事件识别。 实验结果表明, SELFIE 在若干任务中超越了现有的SSL 方法。</s>