TIMIT-TTS:用于多式合成媒体探测的文本到语音数据集 (TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection)

With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors, mainly due to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms. In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with other state-of-the-art sets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both mono and multimodal conditions, showing the need for multimodal forensic detectors and more suitable data.

翻译：随着深层学习技术的迅速发展,多媒体材料的生成和仿制越来越直截了当。与此同时,在网上分享假内容变得如此简单,恶意用户可以轻而易举地制造不愉快的局面。此外,伪造媒体正在变得越来越复杂,操纵的视频正在将图像带过静止的图像。多媒体法医界已经解决了这种情况可能带来的威胁,即通过开发探测器来核查多媒体对象的真实性。然而,这些工具中的绝大多数只是一次分析一种方式。只要图像仍被视为经过最广泛编辑的媒体,就不是一个问题。而现在,由于被操纵的视频已经变得司空见惯,进行单式分析可能会造成不愉快的局面。此外,尽管如此,伪造的媒体正在变得越来越复杂,但是,由于装有伪造的多式联运数据的数据集的伤痕,正在对图像进行训练和测试。在本文中,我们专注于制作视听深层数据集。首先,我们提出一个将大量发言的极深层图像合成数据集成一个总管道,在真实或伪造的视频中进行,而现在,进行单式分析的视频正在成为习惯,因此,进行单式分析可以重新进行。我们提出的数据流数据流数据流数据流,然后再使用。我们提出的方法将数据生成,然后用文字数据流数据流数据流数据流数据流数据流数据流,然后用到战争数据流数据流数据流数据流数据流数据流数据流数据流数据流,然后用到最后,我们用到现代数据流,然后用到现代数据流,然后用到现代数据流,我们算算算算算出。