Thanks to recent advances in deep learning, sophisticated generation tools exist, nowadays, that produce extremely realistic synthetic speech. However, malicious uses of such tools are possible and likely, posing a serious threat to our society. Hence, synthetic voice detection has become a pressing research topic, and a large variety of detection methods have been recently proposed. Unfortunately, they hardly generalize to synthetic audios generated by tools never seen in the training phase, which makes them unfit to face real-world scenarios. In this work, we aim at overcoming this issue by proposing a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations. Since the detector is trained only on real data, generalization is automatically ensured. The proposed approach can be implemented based on off-the-shelf speaker verification tools. We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
翻译:由于最近在深层学习方面取得的进步,现在存在着尖端的一代工具,这些工具产生了非常现实的合成语言,然而,恶意使用这些工具是有可能的,而且有可能对我们的社会构成严重威胁。因此,合成语音探测已成为一个紧迫的研究课题,最近提出了大量各种检测方法。不幸的是,它们几乎无法概括在培训阶段从未看到过的工具所产生的合成音频,这使得它们不适合面对现实世界的情景。在这项工作中,我们的目标是提出一种新的检测方法,仅利用演讲人的生物鉴别特征,而没有提及具体的操作。由于探测器仅接受真实数据培训,因此自动确保通用化。拟议方法可以以现成的演讲者核查工具为基础实施。我们用三种流行的测试工具测试几种这样的解决方案,获得良好的性能,高的普及能力,以及高的音频损伤能力。