This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems. This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles. We develop herein an iterative workflow to extract matching audio and subtitle segments from TV recordings based on a conventional method for lightly-supervised audio-to-text alignment. We evaluate a model trained with our corpus using an evaluation dataset built on Japanese TEDx presentation videos and confirm that the performance is better than that trained with the Corpus of Spontaneous Japanese (CSJ). The experiment results show the usefulness of our corpus for training ASR systems. This corpus is made public for the research community along with Kaldi scripts for training the models reported in this paper.
翻译:本文为培训自动语音识别(ASR)系统提供了一个新的大型日本语言资料库,该资料库包含2,000多小时的语音记录,并附有基于日本电视录音及其字幕的录音誊本。我们在此开发一个迭代工作流程,以传统方法从电视录音中提取匹配的音频和字幕部分,用于轻视的音频到文字协调。我们用日本TEDx演示录像中建立的评价数据集来评价我们所培训的软件模型,并证实其表现优于日本Spontaneous Corpus(CSJ)所培训的软件。实验结果显示,我们为培训ASR系统而提供的软件是有用的。本材料与Kaldi文稿一起向研究界公开,用于培训本文中报告的模型的Kaldi文稿。