Deep acoustic models represent linguistic information based on massive amounts of data. Unfortunately, for regional languages and dialects such resources are mostly not available. However, deep acoustic models might have learned linguistic information that transfers to low-resource languages. In this study, we evaluate whether this is the case through the task of distinguishing low-resource (Dutch) regional varieties. By extracting embeddings from the hidden layers of various wav2vec 2.0 models (including new models which are pre-trained and/or fine-tuned on Dutch) and using dynamic time warping, we compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages. We then cluster the resulting difference matrix in four groups and compare these to a gold standard, and a partitioning on the basis of comparing phonetic transcriptions. Our results show that acoustic models outperform the (traditional) transcription-based approach without requiring phonetic transcriptions, with the best performance achieved by the multilingual XLSR-53 model fine-tuned on Dutch. On the basis of only six seconds of speech, the resulting clustering closely matches the gold standard.
翻译:深声模型代表基于大量数据的语言信息。 不幸的是,对于区域语言和方言而言,这种资源大多不具备。但是,深声模型可能已经学会了向低资源语言传输的语言信息。在本研究中,我们通过区分低资源(荷兰)区域品种的任务来评估这是否属实。通过从各种瓦夫2vec2.0模型(包括预先培训和/或对荷兰进行微调的新模型)的隐蔽层中提取嵌入的嵌入,并使用动态时间扭曲,我们计算了四种(区域)语言中100多个方言的平均10个字的双词读音差异。我们随后将由此产生的差异矩阵分为四组,将其与黄金标准作比较,并在比较电话抄录的基础上进行分割。我们的结果显示,声模型在不要求电话抄录的情况下,超越了(传统)以笔录为基础的方法(包括预先和/或对荷兰语进行微调的新模型),我们在荷兰语语言XLSR-53模型上进行了最佳的微调。在仅六秒钟的演讲基础上,由此而形成的组合与黄金标准十分接近。