Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, this is the best result for BP among the open ASR systems.
翻译:深层学习技术在各种任务中证明是有效的,特别是在发展语音识别系统方面,即旨在按书面文字顺序改写音句的系统。尽管在这一领域取得了进展,但语音识别仍被认为是困难的,特别是对于缺乏可用数据的语言,例如巴西葡萄牙语(BP)而言。从这个意义上讲,这项工作展示了公共自动语音识别系统的发展,该系统仅使用开放的可用音频数据,通过微调Wav2vec 2.0 XLSR-53模型,以多种语言对BP数据进行预先培训,对BP数据进行微调。最后模型显示,对7个不同数据集的平均字差错率为12.4%(在应用语言模型时为10.5% ) 。根据我们的知识,这是开放的ASR系统中BP的最佳结果。