Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.
翻译:最新的语音识别模型依靠大型监管数据集, 许多低资源语言都无法获得这些数据。 在这项工作中, 我们提出了一个语音识别管道, 不需要目标语言的任何音频。 唯一的假设是, 我们能够获得原始文本数据集或一组 n- gram 统计数据。 我们的语音验证包括三个组成部分: 音频、 发音和语言模型。 与标准管道不同, 我们的声频和发音模型使用多种语言模型, 没有任何监督。 语言模型是使用 n- gram 统计数据或原始文本数据集构建的。 我们通过将它与 Crubadan 合并, 建立1909 语言的语音识别管道: 一个大型濒危语言 n- gram 数据库。 此外, 我们测试了我们跨越两个数据集的129种语言: 通用语音和 CMU Warderness 数据集。 我们用Crubadan 统计数据在Wilderness 数据集上实现了50% CER 和 74% WER 。 我们只用 Crubadan 统计数据将其提高到45% CER 和 69% WER 。