Automatic speech recognition systems are part of people's daily lives, embedded in personal assistants and mobile phones, helping as a facilitator for human-machine interaction while allowing access to information in a practically intuitive way. Such systems are usually implemented using machine learning techniques, especially with deep neural networks. Even with its high performance in the task of transcribing text from speech, few works address the issue of its recognition in noisy environments and, usually, the datasets used do not contain noisy audio examples, while only mitigating this issue using data augmentation techniques. This work aims to present the process of building a dataset of noisy audios, in a specific case of degenerated audios due to interference, commonly present in radio transmissions. Additionally, we present initial results of a classifier that uses such data for evaluation, indicating the benefits of using this dataset in the recognizer's training process. Such recognizer achieves an average result of 0.4116 in terms of character error rate in the noisy set (SNR = 30).
翻译:自动语音识别系统是人们日常生活的一部分,嵌入个人助手和移动电话中,帮助作为人体机器互动的促进者,同时允许以实际直觉的方式获取信息。这类系统通常使用机器学习技术来实施,特别是深神经网络。即使该系统在翻译语音文字的任务中表现高超,也很少有作品解决在吵闹环境中识别该文本的问题,而且通常使用的数据集并不包含噪音声学实例,而只是使用数据增强技术来缓解这一问题。这项工作的目的是展示在无线电传输中通常存在的干扰导致声音退化的特定情况下,建立噪音声频数据集的过程。此外,我们介绍了使用这些数据进行评估的分类器的初步结果,指出在识别器的培训过程中使用该数据集的好处。这种识别器在噪音组合中平均得出了0.4116个字符错误率的结果(SNR=30)。