In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigorous ablation study to find the most effective solution to improve dysarthric speech recognition. We find that straightforward signal processing methods such as stationary noise removal and vocoder-based time stretching lead to dysarthric speech recognition results comparable to those obtained when using state-of-the-art GAN-based voice conversion methods as measured using a phoneme recognition task. Additionally, our proposed solution of a combination of MaskCycleGAN-VC and time stretched enhancement is able to improve the phoneme recognition results for certain dysarthric speakers compared to our time stretched baseline.
翻译:在本文中,我们调查了几种现有的和新的最先进的基因对抗网络(GAN)声音转换方法,用这些方法来增强振动性言语的辨识;我们比较了现有方法的关键组成部分,作为严格的消音研究的一部分,以找到提高振动性言语辨识的最有效解决办法;我们发现,这种直接的信号处理方法,如固定噪音清除和以电码为基础的时间拉长导致振动性言语辨识结果,与使用最先进的GAN(GAN)声音转换方法(用电话辨识任务衡量)取得的结果相比,与使用最先进的GAN(GAN)语音辨识方法相比。此外,我们提议的将MaskCycleGAN-VC(MaskCycleGAN-VC)和时间拉长的增强相结合的办法,能够改进某些震动性言语者电话辨识结果,而我们的时间拉长基线则比我们所花的时间要长得多。