Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.
翻译:工具生成高质量的合成语音信号,其在感知上与从人类发言者录制的语音无法区分,已经非常容易获取。已经提出了多种方法来检测合成语音。其中许多方法使用深度学习方法作为黑箱,而没有提供决策理由。这限制了这些方法的可解释性。在本文中,我们提出了Disentangled Spectrogram Variational Auto Encoder (DSVAE) ,该方法是一个两阶段训练的变分自动编码器,使用分离表示学习处理语音的声谱图,为检测合成语音生成了可解释的信号表示。DSVAE还创建一个激活图,以突出区分合成和真实人类语音信号的声谱图区域。我们使用ASVspoof2019数据集评估了DSVAE获得的表示方法。我们的实验结果表明,在检测6个已知和11个未知语音合成器的合成语音方面,准确率> 98%。我们还对17种不同的语音合成器获得的表示进行可视化,并验证它们确实是可解释的,能够区分从每个合成器中获得的真实和合成语音。