DeLoRes: 低资源音频代表学习的与低资源音频代表学习的缓存空间有关的装饰 (DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning)

Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins objective function, we propose to learn embeddings that are invariant to distortions of an input audio sample, while making sure that they contain non-redundant information about the sample. To achieve this, we measure the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of an audio segment sampled from an audio file and make it as close to the identity matrix as possible. We use a combination of a small subset of the large-scale AudioSet dataset and FSD50K for self-supervised learning and are able to learn with less than half the parameters compared to state-of-the-art algorithms. For evaluation, we transfer these learned representations to 9 downstream classification tasks, including speech, music, and animal sounds, and show competitive results under different evaluation setups. In addition to being simple and intuitive, our pre-training algorithm is amenable to compute through its inherent nature of construction and does not require careful implementation details to avoid trivial or degenerate solutions. Furthermore, we conduct ablation studies on our results and make all our code and pre-trained models publicly available https://github.com/Speech-Lab-IITM/DeLoRes.

翻译：受计算机视觉自监督学习的最新进展的启发,我们在本文件中引入了DeLoRes,这是一个新的通用的语音代表学习方法。我们的主要目标是在资源限制的环境中(数据和计算)使我们的网络学习演示,能够广泛分布于一系列不同的下游任务中。从Barlow Twins目标功能的启发,我们建议学习嵌入不易扭曲输入音频样本的内容,同时确保它们包含非编辑的样本信息。为了实现这一点,我们测量了两个相同网络的产出之间的交叉协调矩阵,这两个网络的输出都含有从音频文件中提取的音频部分扭曲版本的样本,并尽可能接近身份矩阵。我们使用大规模音频数据集和FSD50K的一小部分组合进行自我监督学习,并且能够学习不到与当前模式/图表前的参数。我们将这些学习的演示转移到了9个下游分类任务,包括语音、音乐和动物声音的扭曲版本,并且尽可能接近于身份矩阵中。我们通过构建和演算过程来进行不同的评估,要求我们进行不同的演算。我们通过不同的演算过程进行不同的演算和演算过程,我们进行不同的演算。