The wide deployment of speech-based biometric systems usually demands high-performance speaker recognition algorithms. However, most of the prior works for speaker recognition either process the speech in the frequency domain or time domain, which may produce suboptimal results because both time and frequency domains are important for speaker recognition. In this paper, we attempt to analyze the speech signal in both time and frequency domains and propose the time-frequency network~(TFN) for speaker recognition by extracting and fusing the features in the two domains. Based on the recent advance of deep neural networks, we propose a convolution neural network to encode the raw speech waveform and the frequency spectrum into domain-specific features, which are then fused and transformed into a classification feature space for speaker recognition. Experimental results on the publicly available datasets TIMIT and LibriSpeech show that our framework is effective to combine the information in the two domains and performs better than the state-of-the-art methods for speaker recognition.
翻译:语音生物测定系统的广泛应用通常要求高性能的语音识别算法。然而,先前的语音识别工作大多要么在频率域或时间域内处理语音识别过程,因为时间和频域对语音识别都很重要,因此可能产生不理想的结果。在本文中,我们试图在时间和频域内分析语音信号,并通过提取和引信这两个域的特征,提出用于语音识别的时间频率网络~(TFN) 。根据最近深层神经网络的进展,我们提议建立一个演进神经网络,将原始语音波形和频谱编码成特定域的特征,然后将其结合并转换成供语音识别的分类特征空间。关于可公开获取的数据集TIMIT和LibriSpeech的实验结果显示,我们的框架有效地将这两个域的信息结合起来,并比语音识别的状态方法更好。</s>