Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods. With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25].
翻译:易于获取和使用的音频编辑工具使得音频拼接容易实现。从同一人的多个语音样本中组合可以创建令人信服的伪造文件。检测这种拼接对于公共部门考虑虚假信息以及在法律背景下验证证据的完整性都很重要。遗憾的是,大多数音频拼接的检测算法使用手工特征并进行特定假设。但是,犯罪调查人员通常面临来自未受限制的来源具有未知特征的音频样本,这引起了对更普遍适用的方法的需求。在这项工作中,我们旨在迈出朝着无约束音频拼接检测的第一步以解决这种需求。我们模拟了各种攻击场景,以后处理操作的形式伪装拼接。我们提出了一种用于拼接检测和定位的Transformer序列到序列(seq2seq)网络。我们的广泛评估表明,所提出的方法优于现有的专用拼接检测方法[3、10]以及通用网络EfficientNet [28]和RegNet [25]。