The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.
翻译:深层次学习的成功来自它通过学习低层次代表来掌握数据等级结构的能力,学习低层次代表来掌握数据等级结构。在本文中,我们探索通过应用多层次的相矛盾预测编码(CPC)来自我监督地学习层次语言表述。我们注意到,仅仅堆叠两个CPC模型并不能对单一层次结构产生显著的改进。由于发言常常被描述为不同单元在时间上分布不均的序列,我们提出了一个模式,即低层次CPC模块的输出不统一地缩小,以直接尽量减少高层次CPC模块的损失。后者还旨在通过集中的负面抽样和对预测目标的四分化,使连续的高层代表的不相容性变得不相容,从而在其陈述中执行先前的分离性和离散性,同时通过集中的负面抽样和对预测目标进行四分化。语音信号结构的核算改进了单层次CPC特征,并按下游语音识别任务衡量,提高了所学到的表述的混乱程度,同时导致与近电话边界的信号发生有意义的分离。