Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmentation, it is necessary that the VC model be robust to a wide range of input speech. This motivates the use of a non-autoregressive, non-parallel VC model, and the use of a pretrained ASR encoder within the VC model. This work suggests that despite including many speakers, speaker diversity may remain a limitation to ASR quality. Finally, interrogation of our VC performance has provided useful metrics for objective evaluation of VC quality.
翻译:自动语音识别(ASR) 需要对发声器差异进行强力辨别。语音转换(VC) 改变输入式演讲的发音特征。这对增强 ASR 数据来说是一个有吸引力的特征。在本文中,我们证明语音转换可以作为一种数据增强技术,用于提高ASR的性能,即使是在LibriSpeech(LibriSpeech, 里面有2,456个发言者)。对于ASR 扩增, VC 模式必须对于广泛的输入式演讲具有强力性。这促使在VC 模式中使用非自动、非平行的 VC 模式,并使用预先培训过的ASR 编码器。这项工作表明,尽管有许多发言者,但发言者的多样性仍可能限制ASR 质量。最后,对我们的 VC 性能的探索为客观评估 VC 质量提供了有用的衡量标准。