Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns. Forensic methods that detect synthesized speech are important for protection against such attacks. Forensic attribution methods provide even more information about the nature of synthesized speech signals because they identify the specific speech synthesis method (i.e., speech synthesizer) used to create a speech signal. Due to the increasing number of realistic-sounding speech synthesizers, we propose a speech attribution method that generalizes to new synthesizers not seen during training. To do so, we investigate speech synthesizer attribution in both a closed set scenario and an open set scenario. In other words, we consider some speech synthesizers to be "known" synthesizers (i.e., part of the closed set) and others to be "unknown" synthesizers (i.e., part of the open set). We represent speech signals as spectrograms and train our proposed method, known as compact attribution transformer (CAT), on the closed set for multi-class classification. Then, we extend our analysis to the open set to attribute synthesized speech signals to both known and unknown synthesizers. We utilize a t-distributed stochastic neighbor embedding (tSNE) on the latent space of the trained CAT to differentiate between each unknown synthesizer. Additionally, we explore poly-1 loss formulations to improve attribution results. Our proposed approach successfully attributes synthesized speech signals to their respective speech synthesizers in both closed and open set scenarios.
翻译:语音合成方法可以产生现实的语音合成方法,可用于欺诈、嘲笑和错误信息运动; 检测合成语言的法证方法对于防范此类袭击非常重要; 法医归属方法能够提供更多关于合成语言信号性质的信息,因为它们识别了特定语音合成方法(即语音合成器)用于创建语音信号。 由于声音真实的语音合成器数量越来越多,我们建议一种语音归属方法,该方法可以向在培训期间没有看到的新合成器概括。 为了做到这一点,我们调查在封闭的设定情景和开放设定情景中语音合成器的归属。 换句话说,我们认为一些语音合成器“已知”的合成器(即封闭组合的一部分),而另一些则“未知的”语音合成器(即开放合成器的一部分)用于创建语音信号信号信号。 我们将语音信号作为光谱,在封闭的组合中,我们的分析将公开的语音合成器属性设定在公开的设定中,将各自熟悉的语音合成器的语音合成器的合成器与不为未知的深度的图像合成器之间。 我们将我们所设定的、我们所了解的各自在秘密的语音合成系统中的语音缩略图。