Variation in speech is often represented and investigated using phonetic transcriptions, but transcribing speech is time-consuming and error prone. As an alternative representation, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between different dialect pronunciations, and evaluate these differences by comparing them with available human native-likeness judgments. We show that Transformer-based speech representations lead to significant performance gains over the use of phonetic transcriptions, and find that feature-based use of Transformer models is most effective with one of the middle layers instead of the final layer. We also demonstrate that these neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot be represented by a set of discrete symbols used in phonetic transcriptions.
翻译:语言的变异往往被用语音抄录来表示和调查,但抄录的言论耗费时间且容易出错。因此,作为一种替代表述,我们调查从若干自监神经模型中提取的声学嵌入器。我们使用这些表述来计算英语非本地语言和本地语言语言语言之间以及不同方言读音之间的单词发音差异,并通过将这些差异与现有的人类本地相似性判断进行比较来评估这些差异。我们显示,基于变换器的语音表述在使用语音抄录方面带来显著的绩效收益,并发现以地基为基础的变异器模型的使用在中间层而不是最后层中层中最为有效。我们还表明,这些神经语音表述不仅反映非本地语言语言和本地语言的发音差异,而且反映国家和持续时间的差异,这些差异不能通过在语音抄录中使用的一组离散符号来代表。