FakeAVCeleb:新创声视频多式深假多模式数据集 (FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset)

While significant advancements have been made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. A large amount of good quality datasets is typically required to capture the real-world scenarios to develop a competent deepfake detector. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. Hence, there is a crucial need for creating a good video as well as an audio deepfake dataset, which can be used to detect audio and video deepfake simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset (FakeAVCeleb) that contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the current most popular deepfake generation methods. We selected real YouTube videos of celebrities with four racial backgrounds (Caucasian, Black, East Asian, and South Asian) to develop a more realistic multimodal dataset that addresses racial bias and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.

翻译：虽然在利用深层学习技术生成深层假言方面已经取得了显著进步,但滥用却是一个众所周知的问题。深层假言可以带来严重的安全和隐私问题,因为这些问题可以用来在视频中假冒一个人的身份,用另一个人的脸替换他/她的面孔。最近,在生成一个人的合成人类声音方面出现了一个新问题,因为基于AI的深层学习模型可以合成一个人的声音,只需要几秒钟的音频。随着正在出现使用深层假言音音频和视频进行假冒攻击的威胁,需要新一代深层假言背景探测器来集中关注深层视频和音频。通常需要大量高质量的数据集来捕捉真实世界情景,以开发一个合格的深层假言语探测器。现有的深层假言式数据数据集要么包含深层假话音频视频或音频,而种族偏见也存在。因此,迫切需要制作一个良好的视频以及深层心电流数据数据集,我们可以用来检测音频和视频的深度信息,同时进行。为了填补这个深度的深度视频数据,我们还提议了一种小言、最深层的亚洲数据生成方法。我们提议使用一个小的视听数据生成数据, 。