The recent developments in technology have re-warded us with amazing audio synthesis models like TACOTRON and WAVENETS. On the other side, it poses greater threats such as speech clones and deep fakes, that may go undetected. To tackle these alarming situations, there is an urgent need to propose models that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis. Here, we propose a model based on Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network (BiRNN) that helps to achieve both the aforementioned objectives. The temporal dependencies present in AI synthesized speech are exploited using Bidirectional RNN and CNN. The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.
翻译:最近的技术发展以惊人的音频合成模型(如TACOTRON和WAVENETS)给我们带来了令人惊叹的合成模型。 另一方面,它带来了更大的威胁,如语言克隆和深层假冒,这些威胁可能无法被察觉。为了应对这些令人震惊的情况,迫切需要提出能够帮助将合成的言词与实际的人类言语区分开来并查明这种合成的来源的模型。在这里,我们提出了一个基于革命神经网络(CNN)和双向常态神经网络(BIRNN)的模型,以帮助实现上述两个目标。 AI合成的言辞中存在的时间依赖性被利用了双向式RNN和CNN。 模型将人工合成音频与真实人言中的1.9%,以及精确度为97%的对基础结构进行探测,从而超越了最先进的方法。