Self-supervised representations of speech are currently being widely used for a large number of applications. Recently, some efforts have been made in trying to analyze the type of information present in each of these representations. Most such work uses downstream models to test whether the representations can be successfully used for a specific task. The downstream models, though, typically perform nonlinear operations on the representation extracting information that may not have been readily available in the original representation. In this work, we analyze the spatial organization of phone and speaker information in several state-of-the-art speech representations using methods that do not require a downstream model. We measure how different layers encode basic acoustic parameters such as formants and pitch using representation similarity analysis. Further, we study the extent to which each representation clusters the speech samples by phone or speaker classes using non-parametric statistical testing. Our results indicate that models represent these speech attributes differently depending on the target task used during pretraining.
翻译:目前,许多应用程序正在广泛使用自我监督的语音表达方式。最近,在试图分析每个这些表达方式中的信息类型方面作出了一些努力。大多数此类工作都使用下游模型来测试这些表达方式能否成功地用于具体任务。不过,下游模型通常在代表方式上进行非线性操作,提取原始表达方式中可能不易获得的信息。在这项工作中,我们使用不需要下游模型的方法,分析若干最先进的语音表达方式中的电话和发言者信息的空间组织情况。我们测量了不同层对基本声学参数进行编码的方式,例如形式学和音道学,使用相似性分析。此外,我们利用非参数统计测试,研究每个代表方式将电话或发言者类的语音样本组合到何种程度。我们的结果显示,根据培训前使用的目标任务,这些语言的表达特征不同。</s>