Attackers may manipulate audio with the intent of presenting falsified reports, changing an opinion of a public figure, and winning influence and power. The prevalence of inauthentic multimedia continues to rise, so it is imperative to develop a set of tools that determines the legitimacy of media. We present a method that analyzes audio signals to determine whether they contain real human voices or fake human voices (i.e., voices generated by neural acoustic and waveform models). Instead of analyzing the audio signals directly, the proposed approach converts the audio signals into spectrogram images displaying frequency, intensity, and temporal content and evaluates them with a Convolutional Neural Network (CNN). Trained on both genuine human voice signals and synthesized voice signals, we show our approach achieves high accuracy on this classification task.
翻译:攻击者可能操纵音频,意图是提出伪造的报告,改变公众人物的意见,赢得影响力和权力。不真实多媒体的流行在继续上升,因此必须开发一套工具来确定媒体的合法性。我们提出了一个方法,分析音频信号,以确定它们是否包含真实的人类声音或假的人类声音(即神经声学和波形模型产生的声音),而不是直接分析音频信号,拟议方法将音频信号转换成显示频率、强度和时间内容的光谱图像,并通过一个革命性神经网络(CNN)对其进行评估。我们用真正的人类声音信号和合成声音信号来培训,我们展示了我们的方法在分类任务上非常精确。