具有 " 强力发言人核查 " 频率选择的多流革命神经网络 (Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification)

Speaker verification aims to verify whether an input speech corresponds to the claimed speaker, and conventionally, this kind of system is deployed based on single-stream scenario, wherein the feature extractor operates in full frequency range. In this paper, we hypothesize that machine can learn enough knowledge to do classification task when listening to partial frequency range instead of full frequency range, which is so called frequency selection technique, and further propose a novel framework of multi-stream Convolutional Neural Network (CNN) with this technique for speaker verification tasks. The proposed framework accommodates diverse temporal embeddings generated from multiple streams to enhance the robustness of acoustic modeling. For the diversity of temporal embeddings, we consider feature augmentation with frequency selection, which is to manually segment the full-band of frequency into several sub-bands, and the feature extractor of each stream can select which sub-bands to use as target frequency domain. Different from conventional single-stream solution wherein each utterance would only be processed for one time, in this framework, there are multiple streams processing it in parallel. The input utterance for each stream is pre-processed by a frequency selector within specified frequency range, and post-processed by mean normalization. The normalized temporal embeddings of each stream will flow into a pooling layer to generate fused embeddings. We conduct extensive experiments on VoxCeleb dataset, and the experimental results demonstrate that multi-stream CNN significantly outperforms single-stream baseline with 20.53 % of relative improvement in minimum Decision Cost Function (minDCF).

翻译：音员校验的目的是核查输入演讲词是否与所声称的演讲者相符,而且通常,这种系统是根据单一流情景部署的,其中地物提取器在全频范围内运行。在本文中,我们假设机器在听部分频率范围而不是全频范围时,能够学习足够的知识来完成分类任务,而听部分频率范围而不是全频范围,即所谓的频率选择技术,并进一步提议多流共进神经网络(CNN)的新框架,使用这种语言校验任务技术。拟议框架包含从多个流流产生的多种时间嵌入器,以加强声学模型的稳健性。对于时间嵌入器的多样性,我们考虑以频率选择来增加功能,即用手动将频率的全频段分到几个子带,而每种流的特性提取器可以选择哪个子带作为目标频率域域。不同于传统的单流解决方案,在这个框架中,每种语系只处理一次,同时处理多个流流流。每种流的输入量流的精度是预处理的频率选择器,每个流的频率选择器的频率选择器将全部分频段段段段段段分, 将生成到每个递递递递流的递的递的递递到一个普通级级的递制的底线。