This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition (ASR) dataset in the world. In this paper, we will introduce the recording procedure in detail, including audio capturing devices and corpus environments. And the TALCS corpus is freely available for download under the permissive license1. Using TALCS corpus, we conduct ASR experiments in two popular speech recognition toolkits to make a baseline system, including ESPnet and Wenet. The Mixture Error Rate (MER) performance in the two speech recognition toolkits is compared in TALCS corpus. The experimental results implies that the quality of audio recordings and transcriptions are promising and the baseline system is workable.
翻译:本文介绍了一套新的普通话-英语密码转换语音识别-TALCS系统,适合培训和评价密码转换语音识别系统;TALCS系统来自TAL教育组实际的在线一对一英语教学场景,该组约有587小时的语音抽样,在16千赫兹16千赫兹。据我们所知,TALCS系统是全世界最大的有良好标签的普通话-英语密码转换开源自动语音识别数据集;在本文中,我们将引入详细记录程序,包括音频捕获装置和物质环境;TALCS系统可免费下载许可许可证1。我们利用TALCS系统,在两个流行语音识别工具包中进行ASR实验,以建立基线系统,包括ESPnet和Wenet。两个语音识别工具包中的混结错误率(MERS)表现在TALCSPS系统中进行了比较。实验结果表明,录音记录和抄录的质量很有希望,基线系统是可行的。