There are already some datasets used for fake audio detection, such as the ASVspoof and ADD datasets. However, these databases do not consider a situation that the emotion of the audio has been changed from one to another, while other information (e.g. speaker identity and content) remains the same. Changing emotions often leads to semantic changes. This may be a great threat to social stability. Therefore, this paper reports our progress in developing such an emotion fake audio detection dataset involving changing emotion state of the original audio. The dataset is named EmoFake. The fake audio in EmoFake is generated using the state-of-the-art emotion voice conversion models. Some benchmark experiments are conducted on this dataset. The results show that our designed dataset poses a challenge to the LCNN and RawNet2 baseline models of ASVspoof 2021.
翻译:已经存在一些用于假音频探测的数据集,如ASVspoof和ADD数据集。 但是,这些数据库并不认为音频的情感已经从一个变换到另一个,而其他信息(例如发言者的身份和内容)保持不变。 变化的情感往往导致语义变化。 这可能对社会稳定构成极大威胁。 因此,本文件报告了我们在开发这种情感假音音频探测数据集方面的进展,该数据集名为EmoFake。 EmoFake的假音是使用最先进的情感语音转换模型生成的。 对这一数据集进行了一些基准实验。 结果表明,我们设计的数据集对LCNN和RAWNet2 ASVspoof 2021的基线模型构成了挑战。