Self-Supervised Learning (SSL) has made great strides recently. SSL speech models achieve decent performance on a wide range of downstream tasks, suggesting that they extract different aspects of information from speech. However, how SSL models store various information in hidden representations without interfering is still poorly understood. Taking the recently successful SSL model, HuBERT, as an example, we explore how the SSL model processes and stores speaker information in the representation. We found that HuBERT stores speaker information in representations whose positions correspond to silences in a waveform. There are several pieces of evidence. (1) We find that the utterances with more silent parts in the waveforms have better Speaker Identification (SID) accuracy. (2) If we use the whole utterances for SID, the silence part always contributes more to the SID task. (3) If we only use the representation of a part of the utterance for SID, the silenced part has higher accuracy than the other parts. Our findings not only contribute to a better understanding of SSL models but also improve performance. By simply adding silence to the original waveform, HuBERT improved its accuracy on SID by nearly 2%.
翻译:自助学习(SSL)最近取得了长足的进步。 SSL语言模型在一系列下游任务中取得了体面的表现,表明它们从演讲中提取了不同的信息。 然而,对于SSL模式如何将各种信息储存在隐蔽的表达式中而不受干扰,人们仍然不甚了解。以最近成功的SSL模型HuBERT为例,我们探索了SSL模式如何在演示中处理和存储演讲者信息。我们发现HuBERT将演讲者信息储存在代表器中,其位置与在波形中的沉默相对应。有好几项证据。 (1) 我们发现,波形中较无声部分的言论具有更好的音频识别(SID)准确性。 (2) 如果我们使用SID的全文,沉默部分总是对SID任务做出更多的贡献。 (3) 如果我们只使用SID发音的一部分,沉默部分的准确性比其他部分要高。 我们的发现不仅有助于更好地了解SSL模型,而且还提高了性能。 (我们发现,只是增加了原波形的沉默,HuBERT将其SID的准确性提高了近2%。