This work profoundly analyzes discrete self-supervised speech representations through the eyes of Generative Spoken Language Modeling (GSLM). Following the findings of such an analysis, we propose practical improvements to the discrete unit for the GSLM. First, we start comprehending these units by analyzing them in three axes: interpretation, visualization, and resynthesis. Our analysis finds a high correlation between the speech units to phonemes and phoneme families, while their correlation with speaker or gender is weaker. Additionally, we found redundancies in the extracted units and claim that one reason may be the units' context. Following this analysis, we propose a new, unsupervised metric to measure unit redundancies. Finally, we use this metric to develop new methods that improve the robustness of units clustering and show significant improvement considering zero-resource speech metrics such as ABX. Code and analysis tools are available under the following link.
翻译:这项工作深入分析了通过Generation Spoken语言建模(GSLM)的眼神,自我监督的单独语言表达方式。根据这一分析的结果,我们建议对GSL的离散单元进行实际改进。首先,我们开始从三个轴分析这些单元:解释、可视化和再合成。我们的分析发现,语音单位与电话和电话家庭之间有着高度的相互关系,而它们与语音或性别的关系则较弱。此外,我们发现抽取单元的冗余,并声称一个原因可能是单位的背景。在进行这一分析之后,我们提出了一个新的、不受监督的衡量单位冗余的衡量标准。最后,我们利用这一指标来制定新方法,提高单位集群的稳健性,并表明在考虑诸如ABX等零资源语言衡量标准时,有了显著的改进。代码和分析工具可以在以下链接下找到。