人类对音频深假象的感知 (Human Perception of Audio Deepfakes)

The recent emergence of deepfakes, computerized realistic multimedia fakes, brought the detection of manipulated and generated content to the forefront. While many machine learning models for deepfakes detection have been proposed, the human detection capabilities have remained far less explored. This is of special importance as human perception differs from machine perception and deepfakes are generally designed to fool the human. So far, this issue has only been addressed in the area of images and video. To compare the ability of humans and machines in detecting audio deepfakes, we conducted an online gamified experiment in which we asked users to discern bonda-fide audio samples from spoofed audio, generated with a variety of algorithms. 200 users competed for 8976 game rounds with an artificial intelligence (AI) algorithm trained for audio deepfake detection. With the collected data we found that the machine generally outperforms the humans in detecting audio deepfakes, but that the converse holds for a certain attack type, for which humans are still more accurate. Furthermore, we found that younger participants are on average better at detecting audio deepfakes than older participants, while IT-professionals hold no advantage over laymen. We conclude that it is important to combine human and machine knowledge in order to improve audio deepfake detection.

翻译：最近出现了深假,计算机化的、现实的多媒体假冒,从而发现被操纵和生成的内容。虽然提出了许多用于深假检测的机器学习模型,但人类的检测能力仍然远没有那么深入探讨。由于人类的认知与机器的感知不同,而深假一般是设计来愚弄人类的。迄今为止,这一问题只在图像和视频领域得到解决。为了比较人类和机器探测声音深假的能力,我们进行了一次在线合成实验,我们在这个实验中要求用户通过多种算法来辨别由深假声音生成的粘合式音频样本。200名用户竞拍8976轮游戏,使用人工智能(AI)算法进行音深假探测。我们发现,由于收集的数据,机器在探测声音深假音方面一般比人强,但对于某种攻击类型来说,人类仍然更精确。此外,我们发现较年轻的参与者在探测声音深音频的音频样本方面比老的参与者要好得多。我们发现,在深度的计算机探测过程中,我们没有掌握着重要和深刻的视听优势。