Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods. With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25].
翻译:免费提供且易于使用的音频编辑工具可以直截了当地进行音频传译。通过将同一人的各种语音样本结合起来,可以创建可靠的伪造文件。在公共部门中,在考虑错误信息时,以及在核查证据完整性的法律背景下,发现这种串联非常重要。不幸的是,大多数现有的音频传译检测算法都使用手工制作的特征,并作出具体假设。然而,刑事调查人员经常面临来自无限制的、具有未知特点的来源的音频样本,这就需要更普遍适用的方法。我们这样做的目的是采取第一步,以不受限制的音频传译检测来满足这一需要。我们模拟各种攻击情景,其形式是可能伪装拼字的后处理操作。我们建议采用变换序列到序列(seq2seq)网络进行拼写检测和本地化。我们的广泛评价表明,拟议的方法超越了目前用于拼写检测的专用方法[3、10]以及通用网络高效 [28] 和RegNet [25]。