Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe a sentence in audio in a sequence of words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, as Brazilian Portuguese. In this sense, this work presents the development of an public Automatic Speech Recognition system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages over Brazilian Portuguese data. The final model presents a Word Error Rate of 11.95% (Common Voice Dataset). This corresponds to 13% less than the best open Automatic Speech Recognition model for Brazilian Portuguese available according to our best knowledge, which is a promising result for the language. In general, this work validates the use of self-supervising learning techniques, in special, the use of the Wav2vec 2.0 architecture in the development of robust systems, even for languages having few available data.
翻译:深层学习技术在各种任务中被证明是有效的,特别是在发展语音识别系统方面,即旨在用音频顺序拼写句子的系统。尽管在这一领域取得了进展,但语音识别仍被认为是困难的,特别是缺乏可用数据的语言,巴西葡萄牙语。从这个意义上讲,这项工作展示了公共自动语音识别系统的发展,该系统仅使用开放的音频数据,从微调Wav2vec 2.0 XLSR-53模式中,对巴西葡萄牙语数据进行了许多语言的预先培训。最后模型显示的是11.95%的单词错误率(通用语音数据集),这比根据我们的最佳知识为巴西葡萄牙语提供的最好的开放自动语音识别模式少13%,这对语言来说是一个大有希望的结果。一般而言,这项工作证实在开发稳健的系统时使用自我监督的学习技术,特别是使用Wav2vec 2.0结构,即使语言可用数据很少。