SWCRN:一个高效的革命经常性神经网络,促进端对端语音增强 (WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement)

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.

翻译：由于简单的设计管道,语言强化的端到端神经模型引起了极大的兴趣。为了改进E2E模型的性能,在建模时应当有效地考虑语言的定位和时间顺序特性。然而,在目前大多数SEEE模型中,这些属性不是没有得到充分考虑,就是过于复杂,难以实现。在本文中,我们建议了一个高效的E2E SE模型,称为WaveCRN。在WaveCRN中,语言定位定位功能被一个革命性神经网络(CNN)捕获,而地方特征的时间顺序属性则由堆叠的简单经常性单元(SRU)模拟。与使用长期存储(LSTM)网络的传统的时间顺序模型不同,这些属性要么没有得到充分的考虑,要么过于复杂,要么是过于复杂的模型。此外,为了更有效地抑制基于投入噪音的演讲中的噪音成分,我们提出了一个新的限制性特征掩码(RFM)方法,在隐藏的层上对地图进行升级;这与使用长期存储(LSLAM)的常规时间顺序模型不同,在Slimalimalim 结构上,将估计的语音结构上采用较慢的模型,在Smarimalimal-maxim 格式上,在使用其他的图像上,在演示中采用其他的模型上,其估计的模型上,对正压结果进行。