Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.
翻译:最近,提出了基于深信网的语音活动探测(DBN)建议。 它在发挥多种功能的优势和取得最先进的性能方面很有影响力。 但是,基于 DBN 的深层VAD并没有显示浅层的明显优越性。 在本文中,我们提出了以深信网为基础的VAD(DDNN) 为基础,以解决上述问题。 具体地说,我们预先用一种特别的、不受监督的、不注意的贪婪层-智慧模式对一个深度神经网络进行了训练,然后通过共同的后传算法以监督下的方式对整个网络进行微调。 在培训前阶段,我们把噪音的语音信号作为可见层,并试图提取一个新的特征,最大限度地减少噪音语音信号与其相应的清洁语音信号之间的重建跨作物损失。 实验结果表明,基于DDNND的VAD不仅超越了以DB为主的VAD的尺寸,而且还显示了浅层的深海层的明显性能改进。