Speech dereverberation is an important stage in many speech technology applications. Recent work in this area has been dominated by deep neural network models. Temporal convolutional networks (TCNs) are deep learning models that have been proposed for sequence modelling in the task of dereverberating speech. In this work a weighted multi-dilation depthwise-separable convolution is proposed to replace standard depthwise-separable convolutions in TCN models. This proposed convolution enables the TCN to dynamically focus on more or less local information in its receptive field at each convolutional block in the network. It is shown that this weighted multi-dilation temporal convolutional network (WD-TCN) consistently outperforms the TCN across various model configurations and using the WD-TCN model is a more parameter efficient method to improve the performance of the model than increasing the number of convolutional blocks. The best performance improvement over the baseline TCN is 0.55 dB scale-invariant signal-to-distortion ratio (SISDR) and the best performing WD-TCN model attains 12.26 dB SISDR on the WHAMR dataset.
翻译:高层神经网络模型是这一领域近期工作的主要特点。 时间变迁网络是深层学习模型,建议用于在变动演讲任务中进行序列建模。 在这项工作中,建议采用加权的多种变相深度分离变异法,以取代TCN模型中标准的深度分离变异。这一拟议变异法使TCN能够动态地关注网络中每个变动区块的可接收域中或多或少的当地信息。它表明,这种加权的多变动时间网络(WD-TCN)在各种模型配置和使用WD-TCN模型时,始终超越TCN,这是一种比增加变动区数来改进模型性能的更具参数效率的方法。相对于基准 TCN 0.55 dB 规模变异性信号对扭曲比率(SDEIDER)和最佳性能的WD-TCN模型在WA-MAS 数据上达到12.26 dB SDRSA。