FakeAVCeleb:新创声视频多式深假多模式数据集 (FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset)

With the significant advancements made in generation of forged video and audio, commonly known as deepfakes, using deep learning technologies, the problem of its misuse is a well-known issue now. Recently, a new problem of generating cloned or synthesized human voice of a person is emerging. AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake videos and audios, new deepfake detectors are need that focuses on both, video and audio. Detecting deepfakes is a challenging task and researchers have made numerous attempts and proposed several deepfake detection methods. To develop a good deepfake detector, a handsome amount of good quality dataset is needed that captures the real world scenarios. Many researchers have contributed in this cause and provided several deepfake dataset, self generated and in-the-wild. However, almost all of these datasets either contains deepfake videos or audio. Moreover, the recent deepfake datasets proposed by researchers have racial bias issues. Hence, there is a crucial need of a good deepfake video and audio deepfake dataset. To fill this gap, we propose a novel Audio-Video Deepfake dataset (FakeAVCeleb) that not only contains deepfake videos but respective synthesized cloned audios as well. We generated our dataset using recent most popular deepfake generation methods and the videos and audios are perfectly lip-synced with each other. To generate a more realistic dataset, we selected real YouTube videos of celebrities having four racial backgrounds (Caucasian, Black, East Asian and South Asian) to counter the racial bias issue. Lastly, we propose a novel multimodal detection method that detects deepfake videos and audios based on our multimodal Audio-Video deepfake dataset.

翻译：随着在制作伪造的视频和音频(通常称为深假)方面的显著进步,使用深层学习技术,其误用问题现在是一个众所周知的问题。最近,正在出现产生克隆或合成人类声音的新问题。基于AI的深层学习模型可以合成任何需要几秒钟音频的人的声音。由于正在出现使用深层假冒视频和音频进行假冒攻击的威胁,新的深层假冒检测器需要同时关注,视频和音频。检测深层假冒是一个具有挑战性的任务,研究人员做了许多尝试并提出了一些深层假冒探测方法。为了开发一个良好的深层假冒探测器,需要大量优秀的高质量数据集来捕捉真实的世界情景。许多研究人员为此贡献了一些深层假造数据集,并提供了几个深层假造数据集。然而,几乎所有这些数据集都包含深层假现的视听视频或声音。此外,最近由研究人员提出的深层假死数据集具有种族偏见问题。因此,我们非常需要一个更深层的深层的视频和深层的视频数据。我们需要用最深层的视频和最深层的视频来更新数据。