Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.
翻译:自动扬声器识别算法通常使用短期光谱特征来描述语音音频,这些特征将语音制作的生理和解剖方面编码。这种算法没有充分利用行为性言语特征中出现的由声言人独立的特征。在这项工作中,我们提议建立一个称为深塔的假冒编码网络,直接从原始音频数据中提取声调风格特征。深塔克方法在多个具有挑战性的数据集中优于几个最先进的音频识别系统。通过将DeepTalk与基于最先进的生理性言语特征的语音识别系统相结合,使发言者的识别表现得到进一步的改进。我们还将深塔克纳入当前最先进的语音合成器中。对合成语句的详细分析显示,深塔克捕捉了语音风格模型所必需的F0轮廓。此外,基于深塔克的合成语言与语音识别中的真实发言几乎无法区分。