This paper introduces SpeeChain, an open-source Pytorch-based toolkit designed to develop the machine speech chain for large-scale use. This first release focuses on the TTS-to-ASR chain, a core component of the machine speech chain, that refers to the TTS data augmentation by unspoken text for ASR. To build an efficient pipeline for the large-scale TTS-to-ASR chain, we implement easy-to-use multi-GPU batch-level model inference, multi-dataloader batch generation, and on-the-fly data selection techniques. In this paper, we first explain the overall procedure of the TTS-to-ASR chain and the difficulties of each step. Then, we present a detailed ablation study on different types of unlabeled data, data filtering thresholds, batch composition, and real-synthetic data ratios. Our experimental results on train_clean_460 of LibriSpeech demonstrate that our TTS-to-ASR chain can significantly improve WER in a semi-supervised setting.
翻译:本文介绍Spee Chain(一个以开放源码为主的Pytorchin)工具包,该工具包旨在开发机器语音链供大规模使用,首版侧重于TTS到ASR链,这是机器语音链的核心组成部分,系指TTS数据通过ASR的无语文本增强数据。为了为大型TTS到ASR链建立一个高效管道,我们实施了易于使用的多GPU批量级示范推断、多数据载荷批量生成和实时数据选择技术。本文首先解释了TTS到ASR链的总体程序以及每个步骤的困难。然后,我们详细介绍了关于不同类型无标签数据、数据过滤阈值、批量构成和真实合成数据比率的详细模拟研究。我们在LibriSpeech的列车_460的实验结果显示,我们的TS-ASR链可以在半超强环境大大改进WER。