Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on K\"urzinger et al. work, and, to the best of our knowledge, the dataset exploration tool is the world's first open-source tool of this kind. We demonstrate how to apply these tools to create a Russian speech dataset and analyze existing speech datasets (Multilingual LibriSpeech, Mozilla Common Voice). The tools are open sourced as a part of the NeMo framework.
翻译:自动语音识别和文本到语音系统主要是以监督方式培训的,需要高质量、准确标签的语音数据集。 在这项工作中,我们研究了语音数据的共同问题,并引入了语音数据集构建和互动错误分析工具箱。 构建工具基于 K\ “ zurzinger et al. ” 工作, 据我们所知, 数据集探索工具是世界首个这类开放源工具。 我们演示了如何应用这些工具创建俄罗斯语音数据集和分析现有语音数据集( Multi lebriSpeech, Mozilla Common Voice)。 这些工具作为Nemo 框架的一部分是开放的。