We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.
翻译:我们为哈萨克语提供了一个开放源码的语音资料库。哈萨克语语音资料库(KSC)包含大约332小时的转录音频,由来自不同地区和年龄组的参与者以及男性和女性的153,000多段话组成,经过哈萨克语本地人仔细检查,以确保高质量。KSC是用于推进哈萨克语言和语言处理应用程序的最大的公开数据库。我们首先描述数据收集和预处理程序,然后描述数据库的规格。我们还分享了在数据库建设期间我们面临的经验和挑战,这可能有助于其他研究人员计划为低资源语言建立语音资料库。为了证明数据库的可靠性,我们进行了初步语音识别实验。实验结果表明,音频和文字记录的质量是很有希望的(2.8%的字符误差率和测试集8.7%)。为了能够进行实验,并方便软件的使用,我们还为我们的语音识别模型发放了ESPnet配方。