This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
翻译:本文介绍了GigaSpeech,这是一个不断发展的、多域的英语语音识别系统,具有10 000小时的高质量标签声频,适合监督培训,还有40 000小时的音频,完全适合半监督和不受监督的培训;大约40 000小时的转录音频,首先从音频书籍、播客和YouTube收集,涵盖阅读和自发发言风格,以及各种专题,如艺术、科学、体育等;建议建立一个新的强制校正和分解管道,以创建适合语音识别培训的句段,并用低质量的转录器过滤出部分;关于系统培训,GigaSpeech提供五种不同大小的子集,即10小时、250小时、1000小时、2 500小时和10 000小时;关于我们10 000小时的XL培训组,我们在过滤/校验阶段将字误率限制在4%;关于我们所有其他较小的培训组,我们将其限制在0%。对于DEV和TEST评价组,则由专业的人翻譯员重新处理,即保证高翻译质量。