There are several domains that own corresponding widely used feature extractors, such as ResNet, BERT, and GPT-x. These models are usually pre-trained on large amounts of unlabeled data by self-supervision and can be effectively applied to downstream tasks. In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus, which belongs to the audiobook domain. However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English. To verify its universality over languages, we apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages. We achieve more than 20% relative improvements in six languages compared with previous work. Among these languages, English achieves a gain of 52.4%. Moreover, using coarse-grained modeling units, such as subword or character, achieves better results than fine-grained modeling units, such as phone or letter.
翻译:有多个领域拥有相应的广泛使用的地物提取器,如ResNet、BERT和GPT-x。这些模型通常通过自我监督对大量未贴标签的数据进行预先培训,可以有效地应用于下游任务。在语音域中, wav2vec2.0 开始显示其强大的代表性能力和在Librispeechpropost上超低资源语音识别的可行性,Librispeech proposition属于音频域。然而, wav2vec2.0 还没有在真实的口语情景和除英语以外的语言上接受过检查。为核实其通用性,我们应用预先培训的模型解决各种口语的低资源语音识别任务。与以前的工作相比,我们在六种语言中实现了20%的相对改进。在这些语言中,英语取得了52.4%的收益。此外,使用粗微的建模单元,如子词或字符,比精细的模拟单元,例如电话或字母,取得了更好的效果。