Pre-trained speech representations like wav2vec 2.0 are a powerful tool for automatic speech recognition (ASR). Yet many endangered languages lack sufficient data for pre-training such models, or are predominantly oral vernaculars without a standardised writing system, precluding fine-tuning. Query-by-example spoken term detection (QbE-STD) offers an alternative for iteratively indexing untranscribed speech corpora by locating spoken query terms. Using data from 7 Australian Aboriginal languages and a regional variety of Dutch, all of which are endangered or vulnerable, we show that QbE-STD can be improved by leveraging representations developed for ASR (wav2vec 2.0: the English monolingual model and XLSR53 multilingual model). Surprisingly, the English model outperformed the multilingual model on 4 Australian language datasets, raising questions around how to optimally leverage self-supervised speech representations for QbE-STD. Nevertheless, we find that wav2vec 2.0 representations (either English or XLSR53) offer large improvements (56-86% relative) over state-of-the-art approaches on our endangered language datasets.
翻译:诸如 wav2vec 2. 0 等经过事先训练的语音表达方式是自动语音识别(ASR)的有力工具。 然而,许多濒危语言缺乏对此类模式进行预培训的充分数据,或主要是口述方言,没有标准化的书写系统,因此无法进行微调。 逐个字的口头用词检测(QbE-STD)为通过查找语音查询术语来迭接未受限制的语音组合提供了一种代用索引的替代方法。 使用来自7种澳大利亚土著语言和荷兰多种区域语言的数据,所有这些语言都受到威胁或脆弱。 我们发现,通过利用为ASR开发的表达方式(Wav2vec 2.0:英语单语模式和XLSR53多语言模式),QbE-STD可以改进QbE-STD。 令人惊讶的是,英语模型在4个澳大利亚语言数据集上优于多语言模式,提出了如何最佳地利用自我控制的语音表达方式的问题。 尽管如此,我们发现 wav2vec 2.0 表达方式(英语或XLSR53)为我们的濒危域数据方法提供了很大的改进(56-86%相对)。