In this paper, we present a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels. The core idea in self supervised contrastive learning is to map an audio signal and its various augmented versions (representative of salient aspects of audio like pitch, timbre etc.) to a space where they are close together, and are separated from other different signals. In addition we also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals, without access to any labels. Here, we map audio signals on a smaller scale to discrete dictionary elements and train transformers to predict the next dictionary element. We only use data as a method of supervision, bypassing the need of labels needed to act as a supervision for training the deep neural networks. We then use a linear classifier head in order to evaluate the performance of our models, for both self supervised contrastive and generative transformer based representations that are learned. Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model. These representations, with avail-ability of large scale audio data show promise in various tasks for audio understanding tasks
翻译:在本文中,我们提出了在自我监督的框架中学习音频演示的对比性框架,没有获得任何地面的真象标签。自我监督的对比性学习的核心思想是将音频信号及其各种增强版本(音频的突出方面代表,如音频、tmbre等)映射到一个彼此接近的空间,并与其他不同的信号分开。此外,我们还探索了基于基于艺术变压器的状态结构的基因化模型,用于学习音频信号的潜在空间,而没有获得任何标签。在这里,我们绘制了较小规模的音频信号,对离散词典元素进行了较小规模的绘图,并对变压器进行了培训,以预测下一个字典元素。我们仅将数据作为一种监督方法,而忽略了需要标签作为深层神经网络培训监督的一种需要。然后我们用线性分类头来评估我们模型的性能,以自我监督的对比性变压式和基于基因变压器的表达方式为基础,学习了任何标签。我们的系统取得了相当大的性能,而与完全监督的方法相比,有了地面的真象标签以训练神经网络模型。我们只是可能的视听任务。