For languages with insufficient resources to train speech recognition systems, query-by-example spoken term detection (QbE-STD) offers a way of accessing an untranscribed speech corpus by helping identify regions where spoken query terms occur. Yet retrieval performance can be poor when the query and corpus are spoken by different speakers and produced in different recording conditions. Using data selected from a variety of speakers and recording conditions from 7 Australian Aboriginal languages and a regional variety of Dutch, all of which are endangered or vulnerable, we evaluated whether QbE-STD performance on these languages could be improved by leveraging representations extracted from the pre-trained English wav2vec 2.0 model. Compared to the use of Mel-frequency cepstral coefficients and bottleneck features, we find that representations from the middle layers of the wav2vec 2.0 Transformer offer large gains in task performance (between 56% and 86%). While features extracted using the pre-trained English model yielded improved detection on all the evaluation languages, better detection performance was associated with the evaluation language's phonological similarity to English.
翻译:对于缺乏足够资源来培训语音识别系统的语文,逐个查询语音术语探测(QbE-STD)提供了一种途径,通过帮助确定有语音查询条件的区域,获取未加限制的语音资料;然而,当询问和查询程序由不同的发言者发言,在不同的记录条件下制作时,检索工作表现可能很差;利用从各种发言者中挑选的数据以及来自7种澳大利亚土著语言和各种荷兰语的录音条件,所有这些语言都处于危险或脆弱状态,我们评估了这些语言上的QbE-STD表现能否通过利用预先培训过的英语 wav2vec 2.0 模式的表述得到改进。与使用Mel-频率 cepstral系数和瓶颈特征相比,我们发现来自wav2vec 2.0 变异器中间层的表述在任务绩效方面有很大的收益(56%至86%之间),虽然使用预先培训过的英语模型所提取的特征改进了对所有评价语言的探测,但更好的检测业绩与评价语言的声调相似性是相联系的。