用于从言论中提取非协调、自我监督学习的超偏异级别信息 (Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech)

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.

翻译：在最近的研究中, 自我监督的经过训练的模型在传输学习中往往优于受监督的经过训练的长度校正前的模型。特别是, 自我监督的语音代表制学习( SSL) 可以用于语言应用中, 需要在演讲、语言、情感和年龄中有一致属性的区别性表现。现有的框架一级的自我监督的语音代表制, 例如 wav2vec, 可以用在集合中最不引起争议的自我监督的演示式来学习感化水平的嵌入, 但模型通常规模很大。还有一些 SSL 技术可以学习超长的表达式代表制。其中最成功的是对比性能的对比性能方法之一: 选择替代的样本来与当前的样本( 锁定器) 进行对比。但是, 这并不能确保所有的负面样本都属于没有标签的锁定类。本文使用了一种不引起争议的自我监督的自我监督的自我监督方法来学习更深层次的嵌入。我们用NO 标签来改进了从计算机视觉到演讲的等级, 。然后, 需要更深入的升级的调化的标签( ) 。改进了自定义的调制的调制的调制的调制的调制成的调制的调制的调制成。和调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调。。。。。与调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制式的调制的调制式的调制的调制式的调制的调的调的调的调的调的调的调的调的调的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调的调制的调的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的