Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
翻译:通过解决托辞任务,自我监督的学习利用未贴标签的数据来获取有用的潜在代表方式,取代下游任务的传统输入特征。在音频/语音信号处理中,通过数十年的研究努力设计出了一系列广泛的特征。事实证明,学习预测这些特征(a.k.k.伪标签)是一个特别相关的托辞任务,导致有用的自我监督表述方式,证明对下游任务有效。然而,为改进下游任务的业绩而将这些托辞任务合并起来的方法和常见做法尚未得到适当的探讨和理解。事实上,这一过程几乎完全依赖一种计算繁重的实验程序,而这种程序随着借口任务数量的增加而变得难以操作。本文介绍了一种在一组候选人中选择一组托辞任务的方法。我们提出的方法是,根据自我控制的培训过程中考虑的托辞任务,对部分损失进行校准的权重进行估算。关于自动语音识别、演讲人和情绪识别的实验证实了我们的方法,因为所选择的小组和加权的方法比典型的基线要好,从而便利选择和合并相关的假标签自我学习。