Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection.
翻译:设计了多种有希望的数据集,以阻止模拟音频探测的开发,如ASVspoof数据库。然而,以前的数据集忽略了一种攻击性的情况,即黑客在真实语音音频中隐藏了一些小的假剪辑。这构成了严重的威胁,因为很难将小的假剪辑与整个语音表达方式区分开来。因此,本文为半真话音频探测(HAD)开发了这样一个数据集。HAD数据集中部分假音只涉及在发音中更改几个字。这些词的音频是用最新的最先进的语音合成技术生成的。我们不仅能够检测出假外语,而且可以在使用这个数据集的演讲中将被操纵的区域本地化。一些基准结果显示,部分假音带比完全假音音频探测假音要困难得多。