We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on $10\times$ larger of unsupervised data. We will make all our codes available on GitHub.
翻译:我们对未贴标签的音频数据提出了一种新的自我强化学习(SSL)方法,用于在未贴标签的音频数据上进行预编程,从而减少了对大量标签数据进行音频和语音分类的需要。我们的首要目的是学习能够通过低资源、未贴标签的音频预培训设置,在大量语言和非语音任务中广泛推广的音频表达方式。由于最近在基于SSL的语音代表学习方面成功分组和对比学习模式,我们提出了SLICER(对称学习实例和集群一级高效代表方式的学习),这可以减少对大量标签和对比学习模式的需求。我们使用的方法是:在学生和教师编译者的潜在陈述中,以及同时解析实例和集群级对比学习任务之间的一种对称损失。我们通过仅仅将输入光谱投射到一个输出子空间,其尺寸与组数量相当。此外,我们提议以非混合为基础,采用新颖的Mel-mectrocrographrographro 增强程序, k-mix,该方法汇集了最佳的组合和对比学习模式。我们不需要为SLIS-LIM前的标签,从而大大地完成前的SLILIM。