Automatic speaker recognition algorithms typically use physiological speech characteristics encoded in the short term spectral features for characterizing speech audio. Such algorithms do not capitalize on the complementary and discriminative speaker-dependent characteristics present in the behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art physiological speech characteristics-based speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate the DeepTalk method into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.
翻译:自动扬声器识别算法通常使用短期光谱特征编码的生理语音特征,用于描述语音音频。这种算法没有利用行为性言语特征中存在的互补和歧视性的语音特征。在这项工作中,我们提议建立一个名为深方字的假冒编码网络,直接从原始音频数据中提取声样特征。深方字法优于多个具有挑战性的数据集中基于现代生理语言特征的语音识别系统。通过将DeepTalk与最先进的生理性言语特征扬声系统相结合,使语音识别性得到进一步提高。我们还将深方字法纳入当前最先进的语音合成器中,以生成合成语音。对合成语法的详细分析显示,深方字面体捕捉出语音风格模型所必需的F0轮廓。此外,基于深方字体的合成语音显示,在语音识别背景下,基于深方字形的合成语句几乎无法与真实的语音辨别。