This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM). Following the findings of such an analysis, we propose practical improvements to the discrete unit for the GSLM. First, we start comprehending these units by analyzing them in three axes: interpretation, visualization, and resynthesis. Our analysis finds a high correlation between the speech units to phonemes and phoneme families, while their correlation with speaker or gender is weaker. Additionally, we found redundancies in the extracted units and claim that one reason may be the units' context. Following this analysis, we propose a new, unsupervised metric to measure unit redundancies. Finally, we use this metric to develop new methods that improve the robustness of units' clustering and show significant improvement considering zero-resource speech metrics such as ABX. Code and analysis tools are available under the following link: https://github.com/slp-rl/SLM-Discrete-Representations
翻译:这项工作通过General Spoken语言模型(GSLM)的眼神,深入分析了独立自我监督的语音表述(单位)。根据这一分析的结果,我们建议对GSL的离散单元进行实际改进。首先,我们开始通过三个轴分析来理解这些单元:口译、可视化和再合成。我们的分析发现,语音单位与电话和电话家庭之间有着高度的相关性,而它们与语音或性别的相关性则较小。此外,我们发现,在提取的单元中存在冗余,并声称一个原因可能是单位的背景。在进行这一分析之后,我们提出了一个新的、不受监督的衡量单位冗余的计量标准。最后,我们利用这一衡量标准来制定新方法,提高单位集群的稳健性,并表明在考虑诸如ABX等零资源语言计量工具时,有了显著改进。代码和分析工具可以在以下链接下查阅:https://github.com/sl-rl/SLM-Disrete-Representationationationationationationationations):</s>