In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.
翻译:虽然最近的端到端学习需要大型的语音公司,但尚未建立英文以外其他语文的开放来源的这种公司。 在本文中,我们描述了YouTube视频和字幕制作的语音识别和语音校验版。我们的方法可以自动过滤视频和字幕,几乎没有语言依赖程序。我们一贯使用基于连接时间分类(CTC)的技术进行自动语音识别(ASR)和基于语音变换的自动语音校验(ASV)方法。我们建立了1,300多小时的数据和2,900小时的日语ASV数据大规模日本ASR基准。