Self-supervised speech representation learning has recently been a prosperous research topic. Many algorithms have been proposed for learning useful representations from large-scale unlabeled data, and their applications to a wide range of speech tasks have also been investigated. However, there has been little research focusing on understanding the properties of existing approaches. In this work, we aim to provide a comparative study of some of the most representative self-supervised algorithms. Specifically, we quantify the similarities between different self-supervised representations using existing similarity measures. We also design probing tasks to study the correlation between the models' pre-training loss and the amount of specific speech information contained in their learned representations. In addition to showing how various self-supervised models behave differently given the same input, our study also finds that the training objective has a higher impact on representation similarity than architectural choices such as building blocks (RNN/Transformer/CNN) and directionality (uni/bidirectional). Our results also suggest that there exists a strong correlation between pre-training loss and downstream performance for some self-supervised algorithms.
翻译:最近,自我监督的语音代表学习是一个繁荣的研究课题。许多算法建议从大规模无标签数据中学习有用的表述,它们应用于广泛的语音任务也得到了调查。然而,没有多少侧重于了解现有方法特性的研究。在这项工作中,我们的目标是对一些最具代表性的自我监督算法进行比较研究。具体地说,我们用现有的相似性措施量化不同自我监督的表述之间的相似性。我们还设计了研究任务,以研究模型的培训前损失与其所学的语音信息数量之间的关联性。我们的研究还发现,除了显示不同自我监督的模式在相同投入下如何表现不同外,培训目标对代表性的影响大于建筑选择,例如建筑块(RNN/Transexter/CNN)和方向(单向/双向)等。我们的结果还表明,在培训前损失与某些自我监督算法的下游表现之间有着很强的关联性关系。