Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing speech is time-consuming and error prone. As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between Norwegian dialect speakers. For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity. We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches on the basis of phonetic transcriptions and MFCC-based acoustic features. We furthermore find that features from the neural models can generally best be extracted from one of the middle hidden layers than from the final layer. We also demonstrate that neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot adequately be represented by a set of discrete symbols used in phonetic transcriptions.
翻译:语音变异往往通过比较同一词句的语音抄录来量化。然而,人工抄录的语音是耗费时间和容易出错的。因此,作为一种替代办法,我们调查从若干自我监督的神经模型中提取的声学嵌入器。我们用这些表达法来计算英语非本地和本地语言使用者之间以及挪威方言语言使用者之间基于字的发音差异。比较先前的几项研究,我们通过将这些差异与现有的人类相似性判断进行比较来评估这些差异与人类感知的相匹配程度。我们显示,从某种特定类型的神经模型(即变异器)中提取的语音表达方式比早期的两种方法更符合人类感知。我们进一步发现,神经模型的特征一般最好取自中间隐藏层之一,而不是最后层。我们还表明,神经语音表达方式不仅能够捕捉部分差异,而且还能反映国家和持续时间差异,这些差异不能被电话转录中使用的一套独立符号充分代表。