We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder-decoder based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm which obtains an even higher accuracy for off-line applications.
翻译:我们在现实生活中经常出现的瞬态和静止噪音的声学环境中进行语音活动探测,我们通过独立地学习其基本几何结构,利用语言和非声音音框的独特空间空间模式。这一过程是通过一个基于神经网络结构的深编码器进行的。这个结构涉及一个编码器,该编码器将光谱特征与时间信息映射到其低维表示器上,这些显示器是应用扩散地图方法产生的。编码器输入一个解码器,将嵌入的数据映射回高维空间。一个深神经网络,经过培训,可以将语音和非声音框架分开,通过将解码器与已知的Difculp 网结构相融合而获得。实验结果显示,与相互竞争的语音活动探测方法相比,其性能有所提高。在精确度、稳健度和一般化能力方面都取得了改进。我们的模型以实时方式运行,并可以纳入基于声音的通信系统。我们还提出了一个批量算法,它获得了离线应用程序的更精确性。